### CAMDEA Tutorial: 
#### *Maps & Geospatial Data with Python. An Introduction*
#### Carlos Hinrichsen

## 1. Import Required Libraries

In [1]:
from sklearn.cluster import AgglomerativeClustering
from sklearn import preprocessing
import pandas as pd
import folium
from folium.plugins import FloatImage

## 2. Applied Introduction to maps (using Folium)

Imagine that we want to see a map centered in Union Station. First, we need to get its coordinates (using for example https://www.google.ca/maps)

In [2]:
#Define coordinates of Union Station
coord_1 = [43.645392415465025, -79.38055543136053]

#Create the map
map_1 = folium.Map(location = coord_1, zoom_start = 25, tiles = "OpenStreetMap")

#Display the map
map_1

In [3]:
map_1.save("Union_Station_map.html")

What if we need another kind of map representation (or tile)

In [4]:
#Define coordinates of Union Station
coord_1 = [43.645392415465025, -79.38055543136053]

# Define the tile
Tile = "OpenStreetMap" # Default
#Tile = "Stamen Terrain"
#Tile = "Stamen Toner"
# Tile = "Stamen Watercolor"

#Create the map
map_1 = folium.Map(location = coord_1, zoom_start = 17, tiles = Tile)

#Display the map
map_1

Now, let's see some landmarks and attractions of Toronto (Ripley's Aquarium, CN Tower, Scotiabank Arena, Rogers Center and Billy Bishop Airport). We will use icon to show the places (different icons could be found in https://fontawesome.com/v4.7.0/icons/)

In [5]:
# First we have the coordinates
coord = [(43.64269861204067, -79.38595014407835),
           (43.64265328493354, -79.38690442791803),
           (43.643538375675675, -79.37913138624386),
           (43.64185358648617, -79.38912529689352),
           (43.6283858229396, -79.39622581292689)
        ]
# Define color pallete
cl = ['green', 'lightblue', 'purple', 'pink', 'lightgray', 'beige', 'orange', 'red', 'blue', 'lightgreen', 'lightred', 'white', 'darkblue', 'cadetblue', 'darkgreen', 'gray', 'black', 'darkpurple', 'darkred']

# Map centered in Ripley's Aquarium
map_2 = folium.Map(location=[43.64269861204067, -79.38595014407835], control_scale=True, zoom_start=15)
# Now we add an icon and color for each location
for i in range(len(coord)):
    folium.Marker([coord[i][0], coord[i][1]], icon=folium.Icon(color=cl[i], icon='info-sign')).add_to(map_2)
map_2

Finally, we will add their respective webpages (clicking the icon) and the landmark names (when moving over the icon)

In [6]:
web = ['<a href="https://www.ripleyaquariums.com/canada/" target="_blank">"Ripleys Aquarium"</a>',
           '<a href="https://www.cntower.ca/en-ca/home.html" target="_blank">"CN Tower"</a>',
           '<a href="https://www.scotiabankarena.com/" target="_blank">"Scotiabank Arena"</a>',
           '<a href="https://www.mlb.com/bluejays/ballpark/information" target="_blank">"Rogers Center"</a>',
           '<a href="https://www.billybishopairport.com/" target="_blank">"Billy Bishop Airport"</a>'
        ]
Names = ["Ripley's Aquarium", "CN Tower", "Scotiabank Arena", "Rogers Center", "Billy Bishop Airport"]

# Map centered in Ripley's Aquarium
map_3 = folium.Map(location=[43.64269861204067, -79.38595014407835], control_scale=True, zoom_start=15)
# Now we add an icon and color for each location
for i in range(len(coord)):
    folium.Marker([coord[i][0], coord[i][1]], popup=web[i], tooltip=Names[i],icon=folium.Icon(color=cl[i], icon='info-sign')).add_to(map_3)
map_3

In [7]:
map_3.save("Toronto.html")

If you want to explore more options regarding different elements in these maps, please visit https://python-visualization.github.io/folium/

## 3. Visualizing Hierarchical Clustering results using maps

#### Data Set

### Real Estate Valuation Data Set 

We will use the data set of Real Estate, provided in the webpage of the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set

### Data Set Information:

1. The house price, `house price of unit area`, is the response variable (Y). The units are ($10000$ New Taiwan Dollar/Ping), where Ping is a local unit, $1$ Ping = $3.3$ meter squared)

On the other hand, we have the following potential predictors:

1. `transaction date`: For example, 2013.250 = 2013 March, 2013.500 = 2013 June, etc.
2. `house age`: Measured in years
3. `distance to the nearest MRT station`: Measured in meters 
4. `number of convenience stores`: The number of convenience stores in the living circle on foot
5. `latitude`: Measured in degrees
6. `longitude`: Measured in degrees

#### Upload the `Real estate valuation data set.xlsx`

In [8]:
data = pd.read_excel('Real estate valuation data set.xlsx', header=[0],sheet_name="Data")

#### Data Cleaning and Transformation

The idea for this exercise is to understand how likely are the houses in terms of  `house age`, `distance to the nearest MRT station`and `number of convenience stores`. We will need `latitude`and `longitude` for plotting purposes:

In [9]:
data.columns

Index(['No', 'X1 transaction date', 'X2 house age',
       'X3 distance to the nearest MRT station',
       'X4 number of convenience stores', 'X5 latitude', 'X6 longitude',
       'Y house price of unit area'],
      dtype='object')

In [10]:
# Getting only the important info
data_f = data.drop(['No','X1 transaction date',
       'Y house price of unit area'], axis=1)

In [11]:
data_f.head()

Unnamed: 0,X2 house age,X3 distance to the nearest MRT station,X4 number of convenience stores,X5 latitude,X6 longitude
0,32.0,84.87882,10,24.98298,121.54024
1,19.5,306.5947,9,24.98034,121.53951
2,13.3,561.9845,5,24.98746,121.54391
3,13.3,561.9845,5,24.98746,121.54391
4,5.0,390.5684,5,24.97937,121.54245


In [12]:
# Remaning the default variable name
data_f.rename(columns={'X2 house age':'AGE','X3 distance to the nearest MRT station':'DST','X4 number of convenience stores':'STR','X5 latitude':'LAT','X6 longitude':'LONG'},inplace=True)

In [13]:
data_f.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AGE     414 non-null    float64
 1   DST     414 non-null    float64
 2   STR     414 non-null    int64  
 3   LAT     414 non-null    float64
 4   LONG    414 non-null    float64
dtypes: float64(4), int64(1)
memory usage: 16.3 KB


#### Plotting using Folium

In [14]:
# Initial geographic points
lat_i = data_f['LAT'].mean()
long_i = data_f['LONG'].mean()
# Create a Map instance
#m1 = folium.Map(location=[lat_i,long_i],
#    zoom_start=13, control_scale=True,
#               tiles='Stamen Terrain'
#              )

m1 = folium.Map(location=[lat_i,long_i],
    zoom_start=13, control_scale=True,
               )

for i in range(len(data_f['LONG'])):
    folium.Marker([data_f['LAT'][i], data_f['LONG'][i]], tooltip=data_f['AGE'][i]).add_to(m1)
m1

#### Clustering Analysis

The main purpose of this analysis is to understand if houses that have similar characteristics could have similar locations.

We will use the hierarchical clustering algorithm:

## Hierarchical Clustering 

As explained above, the main purpose of the clustering is to understand if the houses with similar characteristics could have similar locations.

For the purpose of this tutorial we will choose a number of clusters equal to 3

In [15]:
# Coordinates
coord_f = data_f.drop(['AGE','DST','STR'], axis=1)
# Final scaled dat
data_f2 = data_f.drop(['LAT','LONG'], axis=1)
colnames = data_f2.columns
data_f3 = preprocessing.scale(data_f2)
data_f3 = pd.DataFrame(data=data_f3, columns= colnames)
data_f3.head()

Unnamed: 0,AGE,DST,STR
0,1.255628,-0.792495,2.007407
1,0.157086,-0.616612,1.667503
2,-0.387791,-0.414015,0.307885
3,-0.387791,-0.414015,0.307885
4,-1.117223,-0.549997,0.307885


In [16]:
# Create the clusters
cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')

Clusters = cluster.fit_predict(data_f3)

# Add the results to the categorical dataset
data_f3['Clusters'] = Clusters

In [17]:
# Information
data_f3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   AGE       414 non-null    float64
 1   DST       414 non-null    float64
 2   STR       414 non-null    float64
 3   Clusters  414 non-null    int64  
dtypes: float64(3), int64(1)
memory usage: 13.1 KB


In [18]:
data_f3.head(30)

Unnamed: 0,AGE,DST,STR,Clusters
0,1.255628,-0.792495,2.007407,1
1,0.157086,-0.616612,1.667503,2
2,-0.387791,-0.414015,0.307885,2
3,-0.387791,-0.414015,0.307885,2
4,-1.117223,-0.549997,0.307885,2
5,-0.932668,0.865586,-0.371925,0
6,1.475337,-0.365237,0.987694,1
7,0.227393,-0.631678,0.647789,2
8,1.229263,3.512777,-1.051734,0
9,0.016473,0.554738,-0.371925,0


Now let's plot the different clusters categories:

Now let's see this map using the `leaflet` library:

In the map, over each house will appear the `Cluster Number` 

Additionally, each house is coloured by the cluster:
* Cluster 1: Red
* Cluster 2: Blue
* Cluster 3: Orange

In [19]:
# Create a Map instance
m2 = folium.Map(location=[lat_i,long_i],
    zoom_start=13, control_scale=True)
Cluster = 'Clusters'
for i in range(len(coord_f['LONG'])):
    clust = data_f3[Cluster][i]
    if clust == 0:
        color = 'red'
    elif clust == 1:
        color = 'blue'
    else:
        color = 'orange'
    folium.Marker([coord_f['LAT'][i], coord_f['LONG'][i]], tooltip=data_f3[Cluster][i], icon=folium.Icon(color=color, icon='info-sign')).add_to(m2)

m2

In [20]:
m2.save("Cluster.html")