### Step 2: Data Understanding.

We will use the K-means clustering algorithm to segment and cluster neighbourhoods based on the nearby venues which we will obtain from the Foursquare API.

To use the foursquare API we will need to get hold of a dataset which will contain the list of neighbourhoods along with latitude and longitude values of the neighbourhoods. But, as the neighbourhoods in the city of kolkata are quite large in area, we will instead use Wards (smallest unit areas designated by the Kolkata Municipal Corporation of ease of governance).

The wards are governed by local councillors elected by the public in the municipality elections. There are 143 wards in kolkata maintained the Kolkata Municipal Corporation (KMC). The wards are designated by numbers. The KMC website contains the information regarding the wards like Name of the councillors and office address. We will need the office Address to figure out accurate latitude and longitude values of the wards.

### Step 3: Data Collection.

It was difficult to scrape data frome KMC website because the dataset was divided into 15 pages. I searched on google and find out that someone already prepared the same dataset I was looking for. Here's the link to the dataset:

http://dev.opencity.in/dataset/50dad059-13b7-499d-aa61-9bf78fef7267/resource/dfdbde68-68e2-4c63-ac2d-faea1deb998a/download/kolkota-kmc-councillors-2018-1.csv

Copy paste this link in a new tab of your browser.

Let us first import all the libraries.

In [1]:
pip install folium

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 6.2 MB/s  eta 0:00:01
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np # library to handle data in a vectorized manner
 
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
 
import json # library to handle JSON files
 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
 
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
 
# import k-means from clustering stage
from sklearn.cluster import KMeans
 
import folium # map rendering library

# libraries to read dataset from google drive
import requests 
from io import StringIO
 
print('Libraries imported.')

Libraries imported.


Now we will install Opencage which is a free geocoding API. We will use forward geocoding which is converting Address into Latitude and Longitude values.

In [3]:
pip install opencage

Collecting opencage
  Downloading opencage-1.2.2-py3-none-any.whl (6.1 kB)
Collecting backoff>=1.10.0
  Downloading backoff-1.10.0-py2.py3-none-any.whl (31 kB)
Installing collected packages: backoff, opencage
Successfully installed backoff-1.10.0 opencage-1.2.2
Note: you may need to restart the kernel to use updated packages.


In [4]:
from opencage.geocoder import OpenCageGeocode

In [5]:
# The code was removed by Watson Studio for sharing.

Looking at the office addresses of the KMC dataset, I saw that some of the addresses were not written properly to be fed into the OpenCage API. So I cleaned the dataset and upload it on my Github.

In [6]:
url = 'https://raw.githubusercontent.com/dipayan10/Kolkata_dataset/main/df_kolkata_final.csv'

Now we read the CSV file from the url of my GitHub and name it df_kolkata.

In [7]:
df_kolkata = pd.read_csv(url)
df_kolkata.head()

Unnamed: 0,Ward No.,Councillors,Office Address
0,1,SMT. SITA JAISWARA,"9A, Gopi Mondal Lane, Kolkata-700002"
1,2,SMT. PUSPALI SINHA,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk..."
2,3,DR SANTANU SEN,"Seth Lane, Kolkata-50"
3,4,SHRI GOUTAM HALDAR,"9A, Raja Manindra Road, Kolkata-37"
4,5,SHRI TARUN SAHA,"67B, Khelat Babu Lane, Kolkata-37"


Now we try to learn more about the dataset, shape of the dataset and the data types.

In [8]:
df_kolkata.shape

(144, 3)

In [9]:
df_kolkata.dtypes

Ward No.           int64
Councillors       object
Office Address    object
dtype: object

Now we try to run a trial of the Opencage API to see if it works.

In [10]:
query = df_kolkata['Office Address'][0]
query

'9A, Gopi Mondal Lane, Kolkata-700002'

In [11]:
api_url = f'https://api.opencagedata.com/geocode/v1/json?q={query}&key=c0ca14c914034184a6792d41901f6a1e&language=en&pretty=1'
data = requests.get(api_url).json()
print(data['results'])



[{'annotations': {'DMS': {'lat': "22° 33' 45.46800'' N", 'lng': "88° 21' 46.94400'' E"}, 'MGRS': '45QXE4014195744', 'Maidenhead': 'NL42en35na', 'Mercator': {'x': 9836528.618, 'y': 2562822.943}, 'OSM': {'note_url': 'https://www.openstreetmap.org/note/new#map=16/22.56263/88.36304&layers=N', 'url': 'https://www.openstreetmap.org/?mlat=22.56263&mlon=88.36304#map=16/22.56263/88.36304'}, 'UN_M49': {'regions': {'ASIA': '142', 'IN': '356', 'SOUTHERN_ASIA': '034', 'WORLD': '001'}, 'statistical_groupings': ['LEDC']}, 'callingcode': 91, 'currency': {'alternate_symbols': ['Rs', '৳', '૱', '௹', 'रु', '₨'], 'decimal_mark': '.', 'html_entity': '&#x20b9;', 'iso_code': 'INR', 'iso_numeric': '356', 'name': 'Indian Rupee', 'smallest_denomination': 50, 'subunit': 'Paisa', 'subunit_to_unit': 100, 'symbol': '₹', 'symbol_first': 1, 'thousands_separator': ','}, 'flag': '🇮🇳', 'geohash': 'tunb6g2hb544sbp4kre4', 'qibla': 278.22, 'roadinfo': {'drive_on': 'left', 'speed_in': 'km/h'}, 'sun': {'rise': {'apparent': 16

As we can see, the 'results' key of the json contains all the information. Then we try to seperate the lat and long values from the rest of the json.

In [12]:
demo_Latitude = data['results'][0]['geometry']['lat']
demo_Longitude = data['results'][0]['geometry']['lng']

Now we create a new dataframe called latlong_dataframe and append a series in it which contains the values of the following columns.

In [13]:
my_columns = ['Office Address', 'Latitude', 'Longitude']
latlong_dataframe = pd.DataFrame(columns = my_columns)

In [14]:
latlong_dataframe.append(
       pd.Series(
    [
           df_kolkata['Office Address'][0],
           demo_Latitude,
           demo_Longitude
       ],
   index = my_columns
),
  ignore_index= True
)

Unnamed: 0,Office Address,Latitude,Longitude
0,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304


Now it's time to populate the dataframe using the rest of the office addresses. We will loop over the Opencage API call, seperate out the lat long values and append them to the latlong_dataframe.

In [15]:
latlong_dataframe = pd.DataFrame(columns = my_columns)
for address in df_kolkata['Office Address']:
    api_url = f'https://api.opencagedata.com/geocode/v1/json?q={address}&key=c0ca14c914034184a6792d41901f6a1e&language=en&pretty=1'
    data = requests.get(api_url).json()
    latlong_dataframe = latlong_dataframe.append(
         pd.Series(
         [
             address,
             data['results'][0]['geometry']['lat'],
             data['results'][0]['geometry']['lng']
         ],
         index = my_columns),
    ignore_index = True
    )

In [16]:
latlong_dataframe.head()

Unnamed: 0,Office Address,Latitude,Longitude
0,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304
1,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk...",22.56263,88.36304
2,"Seth Lane, Kolkata-50",22.586374,88.353731
3,"9A, Raja Manindra Road, Kolkata-37",22.612521,88.383454
4,"67B, Khelat Babu Lane, Kolkata-37",22.56263,88.36304


And finally we merge the two dataframes into one on the column "Office Address".

In [17]:
df_kolkata = pd.merge(df_kolkata, latlong_dataframe, on="Office Address")

In [18]:
df_kolkata.head(20)

Unnamed: 0,Ward No.,Councillors,Office Address,Latitude,Longitude
0,1,SMT. SITA JAISWARA,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304
1,2,SMT. PUSPALI SINHA,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk...",22.56263,88.36304
2,3,DR SANTANU SEN,"Seth Lane, Kolkata-50",22.586374,88.353731
3,4,SHRI GOUTAM HALDAR,"9A, Raja Manindra Road, Kolkata-37",22.612521,88.383454
4,5,SHRI TARUN SAHA,"67B, Khelat Babu Lane, Kolkata-37",22.56263,88.36304
5,6,SMT. SUMAN SINGH,"3/1B ,Turner Road, Kolkata-02",22.56263,88.36304
6,7,SHRI BAPI GHOSH,"47A, Bagbazar Street, Kolkata-700003",22.603919,88.366481
7,8,SHRI PARTHA MITRA,"3/1E Raja Raj Ballav Street, Kolkata-03",22.56263,88.36304
8,9,SMT. MITALI SAHA,"13A Madan Mohantala Street, Kolkata-05",22.56263,88.36304
9,10,SMT. KARUNA SENGUPTA,"Raja Nabakrishna Street, Kolkata-05",22.596719,88.366462


In [19]:
df_kolkata.drop(index=12, inplace = True)

Now our updated dataframe contains the Ward numbers, Office addresses, latitudes and longitudes.

Our next task is to collect nearby venues data from Foursquare API. To call a get request we must obtain the latitude and longitude values of the center point of kolkata which is Park Street. We will use the GeoPy library for this task. 

In [20]:
address = 'Park Street, Kolkata, India'

geolocator = Nominatim(user_agent="kolkata_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Kolkata are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Kolkata are 22.5551591, 88.3501171.


Now we write the foursquare API credentials. As this information is sensitive , i am going to hide this cell.

In [21]:
# The code was removed by Watson Studio for sharing.

To visualise the data we have collected so far, we superimpose the wards on a follium map.

In [22]:
# create map of New York using latitude and longitude values
map_kolkata = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, ward in zip(df_kolkata['Latitude'], df_kolkata['Longitude'], df_kolkata['Ward No.']):
    label = '{}'.format(ward)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_kolkata)  
    
map_kolkata

Now we write a function called 'getNearbyVenues' where we take the location values from our df_kolkata dataset, make an api call, recieve a json response of the nearby venues, parse the json specifically for venue categories and venue location data and then append all the data on a pandas dataframe.

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Ward', 
                  'Ward Latitude', 
                  'Ward Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we call the function and create a dataframe called kolkata_venues.

In [24]:
# type your answer here
kolkata_venues = getNearbyVenues(names=df_kolkata['Ward No.'],
                                   latitudes=df_kolkata['Latitude'],
                                   longitudes=df_kolkata['Longitude']
                                  )

1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144


We have now reached the end of our data collection stage. In the final dataframe we have ward numbers, ward location data, venues, venue categories, and venue location data. We have all the features we need to perform exploratory data analysis and feed the data into our machine learning model.

In [25]:
kolkata_venues.shape

(871, 7)

In [26]:
kolkata_venues.head(10)

Unnamed: 0,Ward,Ward Latitude,Ward Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1,22.56263,88.36304,Prowess Pursuit Pvt Ltd,22.564407,88.363346,Business Service
1,1,22.56263,88.36304,Raja Subodh Mallick Square,22.56361,88.359668,Park
2,1,22.56263,88.36304,Georgian Inn,22.559555,88.361805,Hotel
3,1,22.56263,88.36304,Hind INOX,22.564914,88.359152,Multiplex
4,1,22.56263,88.36304,Santosh Mitra Square,22.566008,88.365719,Park
5,1,22.56263,88.36304,Entally Market,22.559952,88.366698,Market
6,1,22.56263,88.36304,Hindmahal,22.560496,88.367319,Indian Restaurant
7,1,22.56263,88.36304,Axis Bank ATM,22.56232,88.362133,ATM
8,2,22.56263,88.36304,Prowess Pursuit Pvt Ltd,22.564407,88.363346,Business Service
9,2,22.56263,88.36304,Raja Subodh Mallick Square,22.56361,88.359668,Park
