# Similarity Analysis of Neighbourhoods

## 1. Problem Description

The aim of this workflow is to find areas that are similar in terms of facilities and venues but are not in the same city. Two areas will be deemed similar if the areas have similar categories of venues within a similar distance. Locations with such similarity will be grouped into the same cluster. The primary goal after forming such clusters is to visualize them as same distinctly coloured points with the colour representing the cluster they belong to. The people who will be benefited are those trying to move to a new city but would like the comforts of their old neighbourhood to make adjusting to a new city easier. Knowing which locations are similar in terms of venues will also be helpful for those trying to expand their business into newer zones as they can use data from where their business made more profits to decide where to expand to.

## 2. Data Used

The post-code and neighbourhood data might have to be leveraged from various online sources, for example, Wikipedia or certain government websites that provide basic location data. The latitude and longitude based data and nearby venues have to be retrieved using the foursquare API by making API calls with the required parameters. The data will then have to be cleaned and transformed into a format that can be fed into a machine learning algorithm as features that can be processed. The postcode and location data will not have a specific format and might need to be parsed from HTML or other sources. For example, an online table: 



| Location | Pincode | State | District |
|---|---|---|---|
| AI staff colony | 400029 | Maharashtra | Mumbai |
| Aareymilk Colony | 400065 | Maharashtra | Mumbai |

The foursquare data is retrieved as JSON strings from API calls.
For example,
```json
{
    "name": "Harry's Italian Pizza Bar",
     "location": {
         "address": "225 Murray St",
         "lat": 40.71521779064671,
         "lng": -74.01473940209351,
         "labeledLatLngs': [{
             "label": "display",
             "lat": 40.71521779064671,
             "lng": -74.01473940209351
             }],
         "distance": 58,
         "postalCode": "10282"
      }
}
```

## 3. Methodology

### 0. Import Libraries

In [1]:
import numpy as np
import pandas as pd
import folium
import json
import time
import requests as req
import geopy.geocoders
from geopy.geocoders import Geolake
from geopy.extra.rate_limiter import RateLimiter
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier

#Geolake parameters
geopy.geocoders.options.default_timeout = 50000
geolocator = Geolake(api_key='ZasbJHf0RtU0JpXS2IlG', timeout=50000)

### 1. Collecting and Preprocessing the Data

#### User Inputs

In [2]:
#Can be user input
#Hard-coded for simplicity
CITIES = [
    {"name": "Chennai", "state": "Tamil Nadu"},
    {"name": "Kolkata", "state": "West Bengal"}
    #{"name": "Mumbai", "state": "Maharashtra"},
]
COUNTRY = "India"

#### Utility functions

In [3]:
def get_pinlatlng(city_name, city_state, COUNTRY, geolocator=None, from_csv=True):
    df = None
    csv_name = 'tests_and_data/' + city_name + '_pinlatlng.csv'
    if(from_csv == True):
        #Get lat-lng of pincodes from previously collected data in CSV files
        df = pd.read_csv(csv_name, header=0)
    else: 
        #Get required data from the INTERNET and GEOCODING SERVICES
        
        #(1) Format city name and state and GET entire page
        city_name = city_name.lower().replace(' ', '-')
        city_state = city_state.lower().replace(' ', '-')
        page_html = req.get("https://www.mapsofindia.com/pincode/india/{}/{}".format(city_state, city_name)).text

        #(2) Get required table of pins for the city
        ts = page_html.find("<table><tr>")
        te = page_html.find("</table>", ts) + 8
        table_html = page_html[ts:te]

        #(3) Convert to dataframe from html
        df = pd.read_html(table_html, header=[1], skiprows=0)[0]
        df = df.groupby('Pincode').count().reset_index().drop(columns=['Location', 'State', 'District'])

        #(4) Add and initialize new columns, pc_lat and pc_lng 
        df['pc_lat'] = 0.0
        df['pc_lng'] = 0.0

        #(5) Get lat-lng for each pincode
        print('Fetching coordinate data...')
        for i, pin in zip(df.index, df['Pincode']):
            pc_loc = geolocator.geocode({'zipcode': pin, 'city': city_name, 'country': COUNTRY})
            time.sleep(1)
            if(pc_loc != None):
                df.loc[i, 'pc_lat'] = pc_loc.latitude
                df.loc[i, 'pc_lng'] = pc_loc.longitude
                print(i, pin, pc_loc.latitude, pc_loc.longitude, city_name, city_state, COUNTRY)
        print('Fetching coordinate data COMPLETED!')
        
        #(6) Keep only pins for which unique coordinates have been retrieved
        df.drop_duplicates(subset=['pc_lat', 'pc_lng'], keep='first', inplace=True)
        
        #(7) Save to CSV file to prevent from making unnecessary API calls
        df.rename(columns={'Pincode': 'pincode'}, inplace=True)
        df.to_csv('tests_and_data/' + csv_name, index=False)
        
    #Dislpay details for santity check
    print("Read Dataframe:")
    display(df.head())
    print("Rows: ", df.shape[0], "\nCols: ", df.shape[1])
    print("Data Types:\n", df.dtypes)

    return df

#### Main script

##### (1) Get coordinates and pincodes of CITIES

In [4]:
def get_pin_dfs(CITIES): #(1) Get coordinates and pincodes of CITIES
    for CITY in CITIES:

        #(1) Get Coordinates from Address using geolocator
        address = "{}, {}".format(CITY["name"], COUNTRY)
        CITY_loc = geolocator.geocode(address)
        CITY['lat'] = CITY_loc.latitude
        CITY['lng'] = CITY_loc.longitude
        print('The geograpical coordinate of {} are {}, {}.'.format(address, CITY['lat'], CITY['lng']))

        #(2) Get pincodes within cities by scraping www.mapsofindia.com
        CITY['pin_df'] = get_pinlatlng(CITY["name"], CITY["state"], COUNTRY, geolocator, from_csv=True)

        print("========================================================================")
        
get_pin_dfs(CITIES)

The geograpical coordinate of Chennai, India are 13.08784, 80.27847.
Read Dataframe:


Unnamed: 0,pincode,pc_lat,pc_lng
0,600001,13.093,80.2882
1,600002,13.0744,80.2714
2,600003,13.0819,80.2781
3,600004,13.0292,80.2708
4,600005,13.0572,80.2778


Rows:  44 
Cols:  3
Data Types:
 pincode      int64
pc_lat     float64
pc_lng     float64
dtype: object
The geograpical coordinate of Kolkata, India are 22.56263, 88.36304.
Read Dataframe:


Unnamed: 0,pincode,pc_lat,pc_lng
0,700001,22.5947,88.3645
1,700007,22.5667,88.35
2,700010,22.7874,88.256
3,700014,22.1745,88.4933
4,700015,22.55,88.3833


Rows:  29 
Cols:  3
Data Types:
 pincode      int64
pc_lat     float64
pc_lng     float64
dtype: object


##### (2) Plotting pincodes on a folium map for sanity check


In [5]:
#(2) Plotting pincodes on a folium map for sanity check
for CITY in CITIES:
    CITY_map = folium.Map(location=[CITY['lat'], CITY['lng']], zoom_start=12)
    pin_df = CITY['pin_df']
    for pc, pc_lat, pc_lng in zip(pin_df['pincode'], pin_df['pc_lat'], pin_df['pc_lng']):
        label = folium.Popup('PIN: '+ str(pc), parse_html=True)
        folium.CircleMarker(
            [pc_lat, pc_lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='red',
            fill_opacity=0.7).add_to(CITY_map)
    print("\n\n===============================\nMap of ", CITY["name"], "\n===============================")
    display(CITY_map)



Map of  Chennai 




Map of  Kolkata 


##### (3) Retrieving Venues (specifically, venue categories) for each pin using foursquare queries

###### 1. Feature Extraction Utilities

In [6]:
#Utility functions that return the required constants
def get_connection_credentials():
    conn = {
        'ID':'KXNX20EHFY1PJIVOLLCOAHUNCBXL0LYUMU1QXZX2T1SNYBCR',
        'SECRET': 'Y03142KYAYRBHHJG3NMXHH2OHP301WDOPMOJIWHMFED03534',
        'VERSION':'20180605'
    }
    return conn

In [7]:
def get_url():
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'
    return url

In [8]:
def get_category_name(row):
    return row['category'][0]['name']

###### 2. Getting top venues for a Postcode
<!-- Author: Aritra Koley -->

In [9]:
def get_venues(con, lat, lng, radius, limit):
    #(1) Initializing arguments required to call Foursquare API
    CLIENT_ID = con['ID']
    CLIENT_SECRET = con['SECRET'] 
    VERSION = con['VERSION']
    LAT = lat
    LNG = lng 
    RADIUS = radius 
    LIMIT = limit
    
    #(2) Formatting the url with above arguments
    URL = get_url().format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        LAT, 
        LNG, 
        RADIUS, 
        LIMIT
    )
    
    #(3) Getting top venues as json
    res_venues = req.get(URL).json()
#     js = json.dumps(res_venues)
#     print("JSON String:", js)
    
    venues_df = None
    if(res_venues['response']['totalResults'] > 0):
        #(4) Retrieveing only required bits from response json
        top_venues = res_venues['response']['groups'][0]['items']

        #(5) Converting json dict to pandas dataframe [flattening the json]
        venues_df = json_normalize(top_venues)
        #display(venues_df)


        #(6) Cleaning the df, i.e. filtering and renaming columns 
        required_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
        renamed_columns = ['name', 'category', 'v_lat', 'v_lng']
        venues_df = venues_df[required_columns]
        venues_df.columns = renamed_columns

        #(7) Retrieving categories name and fixing category column
        venues_df.loc[:, 'category'] = venues_df.apply(get_category_name, axis=1)
    
    return venues_df

###### 3. Putting together all top venues for every postcode into one dataframe
<!-- Author: Aritra Koley -->

In [10]:
def get_all_venues(CITY, radius=500, limit=10, show_details=False):
    c=0;
    pc_df = CITY['pin_df']
    all_venues_df = pd.DataFrame(columns=['pincode', 'pc_lat', 'pc_lng', 'name', 'category', 'v_lat', 'v_lng'])
    for pc, pc_lat, pc_lng in zip(pc_df['pincode'], pc_df['pc_lat'], pc_df['pc_lng']):
                
        #(1) Get required details of top venues from Foursquare into a dataframe
        conn = get_connection_credentials()
        venues_df = get_venues(conn, pc_lat, pc_lng, radius, limit)
        
        if(venues_df is not None):
            #(2) Add pc, pc_lat and pc_lng values to all entries in venues_df
            venues_df.loc[:, 'pincode'] = pc
            venues_df.loc[:, 'pc_lat'] = pc_lat
            venues_df.loc[:, 'pc_lng'] = pc_lng

            #(3) Rearrange columns
            new_column_arrangement = list(venues_df.columns[4:]) + list(venues_df.columns[:4])
            venues_df = venues_df[new_column_arrangement]

            #(4) Incrementally combine into one dataframe listing all venues
            all_venues_df = all_venues_df.append(venues_df, ignore_index = True)

            if(show_details == True):
                details = "Processing Pincode: {}\nLocation: ({}, {})"
                print(details.format(pc, pc_lat, pc_lng))
                c = c + venues_df.shape[0]
                print("Venues Found: ",  c)
                print("\nVenues Dataframe:")
                display(venues_df)
                print('============================================================================================')

    all_venues_df['city'] = CITY['name']

    print("\n\n===================\nTotal Venues Found: ", all_venues_df.shape[0])
    print("Unique Categories Retrieved:", all_venues_df.loc[:, 'category'].unique().shape[0])
    print("Unique Categories Retrieved:", all_venues_df.loc[:, 'category'].unique())
    print("===================\n\nAll Venues DF:")
    display(all_venues_df.head())
    
    return all_venues_df

for CITY in CITIES:
    CITY['all_venues_df'] = get_all_venues(CITY, show_details=False)



Total Venues Found:  302
Unique Categories Retrieved: 86
Unique Categories Retrieved: ['Video Store' 'Asian Restaurant' 'Fast Food Restaurant' 'Bus Station'
 'General Travel' 'Bookstore' 'Indian Restaurant' 'Sandwich Place'
 'Vegetarian / Vegan Restaurant' 'Train Station' 'Hotel' 'Juice Bar'
 'Flea Market' 'Beach' 'Platform' 'Middle Eastern Restaurant' 'Train'
 'Soccer Stadium' 'Food' 'Thai Restaurant' 'Coffee Shop' 'Museum'
 'Kebab Restaurant' 'Bakery' 'Frozen Yogurt Shop' 'Daycare' 'Café'
 'Ice Cream Shop' 'Gym / Fitness Center' 'Shopping Mall' 'Restaurant'
 'Snack Place' 'Pizza Place' 'Italian Restaurant' 'Diner' 'Multiplex'
 'Dessert Shop' 'Pier' 'Park' 'Department Store' 'Smoke Shop'
 'Arts & Crafts Store' 'Record Shop' 'Kerala Restaurant'
 'Chinese Restaurant' 'African Restaurant' 'Garden' 'Tea Room' 'Lounge'
 'Fruit & Vegetable Store' 'Fried Chicken Joint' 'BBQ Joint' 'Spa'
 'Boutique' 'Indie Movie Theater' 'Amphitheater' 'Clothing Store'
 'South Indian Restaurant' 'Arcade' 'C

Unnamed: 0,pincode,pc_lat,pc_lng,name,category,v_lat,v_lng,city
0,600001,13.093,80.2882,Burma Bazaar,Video Store,13.088904,80.289918,Chennai
1,600001,13.093,80.2882,Murugan Idli Shop,Asian Restaurant,13.088824,80.287842,Chennai
2,600001,13.093,80.2882,Burma Food,Fast Food Restaurant,13.092284,80.29087,Chennai
3,600001,13.093,80.2882,Broadway,Bus Station,13.090351,80.285716,Chennai
4,600002,13.0744,80.2714,Chintatri Pet MRTS,General Travel,13.07301,80.274172,Chennai




Total Venues Found:  60
Unique Categories Retrieved: 41
Unique Categories Retrieved: ['Indian Sweet Shop' 'Market' 'Fast Food Restaurant' 'Indian Restaurant'
 'Hotel' 'Neighborhood' 'Mughlai Restaurant' 'Bengali Restaurant' 'Bakery'
 'Park' 'Restaurant' 'Chinese Restaurant' 'Plaza' 'Pizza Place' 'Dhaba'
 'Flea Market' 'Food' 'Grocery Store' 'Snack Place' 'Train Station'
 'Dumpling Restaurant' 'Bookstore' 'Shopping Mall' 'Coffee Shop'
 'Department Store' 'Clothing Store' 'Multiplex' 'Bus Station'
 'Jewelry Store' 'Gym Pool' 'Metro Station' 'Dessert Shop' 'Café'
 'Italian Restaurant' 'Garden' 'Convenience Store' 'Food Truck' 'ATM'
 'Bar' 'Electronics Store' 'Gun Range']

All Venues DF:


Unnamed: 0,pincode,pc_lat,pc_lng,name,category,v_lat,v_lng,city
0,700001,22.5947,88.3645,Girish Chandra Dey & Nakur Chandra Nandy,Indian Sweet Shop,22.59604,88.367485,Kolkata
1,700001,22.5947,88.3645,Hatibagan Market,Market,22.595477,88.366662,Kolkata
2,700001,22.5947,88.3645,Mitra Café,Fast Food Restaurant,22.595925,88.363917,Kolkata
3,700001,22.5947,88.3645,Allen's Kitchen,Indian Restaurant,22.593607,88.364689,Kolkata
4,700001,22.5947,88.3645,Chittaranjan Mistanna Bhandar,Indian Sweet Shop,22.598481,88.366942,Kolkata


In [11]:
for CITY in CITIES:
    print(CITY['all_venues_df'].shape)
    CITY['all_venues_df'].to_csv(CITY['name'] + '_all_venues.csv', index=False)

(302, 8)
(60, 8)


In [12]:
#Concatenating all CITY['all_venues_df']s into one large CITIES_venue_df
CITIES_venues_df = pd.DataFrame(columns=CITIES[0]['all_venues_df'].columns)
for CITY in CITIES:
    CITIES_venues_df =  CITIES_venues_df.append(CITY['all_venues_df'], ignore_index=True)

CITIES_venues_df.shape

(362, 8)

##### (4) Building the feature vector dataframe 
<!-- Author: Aritra Koley -->
With the features (mean of each category) of each postcode 

In [13]:
def one_hot(CITIES_venues_df):
    #(1) One-hot encode the 'category' column and add 'postcode' properties column to new df
    onehot_df = pd.get_dummies(CITIES_venues_df[['category']], prefix="", prefix_sep="")
    onehot_df.insert(0, 'pincode', CITIES_venues_df.loc[:, 'pincode'])
    onehot_df.insert(1, 'pc_lat', CITIES_venues_df.loc[:, 'pc_lat'])
    onehot_df.insert(2, 'pc_lng', CITIES_venues_df.loc[:, 'pc_lng'])
    onehot_df.insert(3, 'city', CITIES_venues_df.loc[:, 'city'])
    return onehot_df
    
def get_features(onehot_df):
    #(2) Group-by 'postcode' and aggregate using 'mean'
    #This is the feature vector df containing features of every 'postcode'
    features_df = onehot_df.groupby('pincode').mean().reset_index()
    print("Onehot_DF:", onehot_df.shape)
    display(onehot_df.head())
    print("Features_df:", features_df.shape)
    display(features_df.head())
    return features_df


In [14]:
#Use one_hot on entire data of both cities to get the right number of columns
onehot_df = one_hot(CITIES_venues_df)

#Break the one_hotted df into 2 dfs based on cities and then 
#Cluster on one city
#and use those clusters to classify the other city
onehot_df0 = onehot_df[onehot_df['city'] == CITIES[0]['name']]
onehot_df1 = onehot_df[onehot_df['city'] == CITIES[1]['name']]

feature_df0 = get_features(onehot_df0)
feature_df1 = get_features(onehot_df1)

Onehot_DF: (302, 101)


Unnamed: 0,pincode,pc_lat,pc_lng,city,ATM,African Restaurant,Airport Lounge,Amphitheater,Arcade,Arts & Crafts Store,...,Spa,Sports Bar,Tea Room,Thai Restaurant,Theater,Train,Train Station,Vegetarian / Vegan Restaurant,Video Store,Women's Store
0,600001,13.093,80.2882,Chennai,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,600001,13.093,80.2882,Chennai,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,600001,13.093,80.2882,Chennai,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,600001,13.093,80.2882,Chennai,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,600002,13.0744,80.2714,Chennai,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Features_df: (43, 100)


Unnamed: 0,pincode,pc_lat,pc_lng,ATM,African Restaurant,Airport Lounge,Amphitheater,Arcade,Arts & Crafts Store,Asian Restaurant,...,Spa,Sports Bar,Tea Room,Thai Restaurant,Theater,Train,Train Station,Vegetarian / Vegan Restaurant,Video Store,Women's Store
0,600001.0,13.093,80.2882,0.0,0.0,0.0,0.0,0.0,0.0,0.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0
1,600002.0,13.0744,80.2714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,600003.0,13.0819,80.2781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.1,0.0,0.0
3,600004.0,13.0292,80.2708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0
4,600005.0,13.0572,80.2778,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.222222,0.0,0.0


Onehot_DF: (60, 101)


Unnamed: 0,pincode,pc_lat,pc_lng,city,ATM,African Restaurant,Airport Lounge,Amphitheater,Arcade,Arts & Crafts Store,...,Spa,Sports Bar,Tea Room,Thai Restaurant,Theater,Train,Train Station,Vegetarian / Vegan Restaurant,Video Store,Women's Store
302,700001,22.5947,88.3645,Kolkata,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
303,700001,22.5947,88.3645,Kolkata,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
304,700001,22.5947,88.3645,Kolkata,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
305,700001,22.5947,88.3645,Kolkata,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
306,700001,22.5947,88.3645,Kolkata,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Features_df: (17, 100)


Unnamed: 0,pincode,pc_lat,pc_lng,ATM,African Restaurant,Airport Lounge,Amphitheater,Arcade,Arts & Crafts Store,Asian Restaurant,...,Spa,Sports Bar,Tea Room,Thai Restaurant,Theater,Train,Train Station,Vegetarian / Vegan Restaurant,Video Store,Women's Store
0,700001.0,22.5947,88.3645,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,700007.0,22.5667,88.35,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,700015.0,22.55,88.3833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,700019.0,22.529,88.368,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,700022.0,22.55,88.3333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 2. Clustering on the feature set (CITY 1)
<!-- Author: Aritra Koley -->

#### 1. Generating the clusters
<!-- Author: Aritra Koley -->

In [15]:
def get_clusters(features_df, k):
        
    #
    #(1) Cluster using only the features and not the 'postcode' column
    features_clustering = features_df.drop(['pincode', 'pc_lat', 'pc_lng'], axis=1)
    kmeans = KMeans(n_clusters=k, random_state=0).fit(features_clustering)
    
    print(kmeans.labels_)
    #(2) Add cluster column to features_df
    cluster_df = pd.DataFrame({
        'pincode': features_df.loc[:, 'pincode'],
        'pc_lat': features_df.loc[:, 'pc_lat'],
        'pc_lng': features_df.loc[:, 'pc_lng'],
        'cluster': kmeans.labels_
    })
    
    return cluster_df

k = 5
cluster_df0 = get_clusters(feature_df0, k)
display(cluster_df0.head())
#display(cluster_df.tail())

[0 4 1 1 1 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 0 3 0 1 1 0 1 0 1 0 0
 1 0 0 0 0 2]


Unnamed: 0,pincode,pc_lat,pc_lng,cluster
0,600001.0,13.093,80.2882,0
1,600002.0,13.0744,80.2714,4
2,600003.0,13.0819,80.2781,1
3,600004.0,13.0292,80.2708,1
4,600005.0,13.0572,80.2778,1


In [16]:
#Displaying map of CITY0 with ALL CLUSTERS marked
lat =CITIES[0]['lat']
lng =CITIES[0]['lng']
cluster_df = cluster_df0

map1 = folium.Map(location=[lat, lng], zoom_start=11)

for pc, lat, lon, cl in zip(cluster_df['pincode'], cluster_df['pc_lat'], cluster_df['pc_lng'], cluster_df['cluster']):
    text = '{}, {}'.format(pc, cl)
    label = folium.Popup(text, parse_html=True)
    colour = ['red', 'blue', 'green', 'yellow', 'black']
    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=colour[cl],
        fill=True,
        fill_color=colour[cl],
        fill_opacity=0.7).add_to(map1)
    
map1

### 3. Classification based on found clusters (CITY 2)
<!-- Author: Aritra Koley -->

In [17]:
#Classify
X_train = feature_df0.drop(columns=['pincode', 'pc_lat', 'pc_lng'])
Y_train = cluster_df0['cluster']

knn_clf = KNeighborsClassifier(n_neighbors=5, weights='distance')
knn_clf.fit(X_train, Y_train)

print(knn_clf.score(X_train, Y_train))

X_test = feature_df1.drop(columns=['pincode', 'pc_lat', 'pc_lng'])
Y_test = knn_clf.predict(X_test)
print(Y_test)
cluster_df1 = pd.DataFrame({
        'pincode': feature_df1.loc[:, 'pincode'],
        'pc_lat': feature_df1.loc[:, 'pc_lat'],
        'pc_lng': feature_df1.loc[:, 'pc_lng'],
        'cluster': Y_test
    })
cluster_df1.head(10)

1.0
[1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1]


Unnamed: 0,pincode,pc_lat,pc_lng,cluster
0,700001.0,22.5947,88.3645,1
1,700007.0,22.5667,88.35,1
2,700015.0,22.55,88.3833,0
3,700019.0,22.529,88.368,0
4,700022.0,22.55,88.3333,0
5,700027.0,22.532,88.3232,0
6,700031.0,22.5543,88.3132,1
7,700033.0,22.5041,88.3598,0
8,700040.0,22.5333,88.3917,0
9,700047.0,22.4667,88.3833,0


In [18]:
#Displaying map of CITY1 with ALL CLUSTERS marked
lat =CITIES[1]['lat']
lng =CITIES[1]['lng']
cluster_df = cluster_df1

map1 = folium.Map(location=[lat, lng], zoom_start=11)

for pc, lat, lon, cl in zip(cluster_df['pincode'], cluster_df['pc_lat'], cluster_df['pc_lng'], cluster_df['cluster']):
    text = '{}, {}'.format(pc, cl)
    label = folium.Popup(text, parse_html=True)
    colour = ['red', 'blue', 'green', 'yellow', 'black']
    
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=colour[cl],
        fill=True,
        fill_color=colour[cl],
        fill_opacity=0.7).add_to(map1)
    
map1

## 4. Conclusion

The results show that the workflow is functional and is capable of detecting similar locations in two different cities. From the above we see that there are certain locations in Chennai that do not have suitable counterparts in Kolkata. The class coloured blue is the most common type of location in both the cities with red coming a close second mostly associated with urban neighbourhoods. More in-depth analysis will be required to determine the properties of each cluster as distinct form the other ones but for a preliminary similarity assessment of the locations of two cities, that sort of in-depth analysis is out of the scope of this report.