# Introduction to the IBM's applied data science capstone project

This notebook introduces a business problem which is meant to be solved using data science and machine learning and will also walk through the steps on how the data is collected, cleaned and prepared to be fed in the machine learning model.

### Step 1: Introduction / Business understanding

I have chosen this data science project as an opportunity to work on a business idea that I formed in my head since college. My dream is to set up a chain of artificial turfs for various sports and activities like football, cricket, fitness workshops etc. This has huge upside potential in India, especially in Kolkata which is my hometown, because:

*   The parks and fields available to the public are not well maintained and well suited for a seemless playing experience.  

*   The parks and playing fields are not always available (to suit the needs of working class) due to lack of infrastructure (proper lighting at night, nets, designated playing areas, level playing fields, etc.)

*  Interrputions caused due to religious ceremonies, fairs, political gatherings etc.

* Artificial turfs equipped with sheds will be a bonus as turfs can also be utilized in bad weather (like rainy days or in hot summer days etc.) 


If a business model is followed where we offer our target audience a designated playing time with decent infrastructure, people will be more likely to pay fees as people are becoming more and more health conscious and trying hard to be fit and active in their fast paced lives.

One of the most import decisions to make before starting this business is to determine the location where the turf has to be set up to reach maximum potential. In this data science project I have tried to utilize which I have learnt from the 9 courses and designed a machine learning model using the K-means algorithm to find the optitum location to set up the chain of artificial turfs.

### Step 2: Business problem

To determine the location or neighbourhood where setting up turfs would be the most beneficial we will use an unsupervised learning algorithm which is k-means clustering. Using foursquare data we will gather data of nearby venues of each neighbourhood or wards. By exploratory data anaylsis we can determine the ratio of occurence of most common venues of each ward.

Then we will feed the dataset into the k-means clustering algorithm to cluster the wards to their likeness to each other. Our goal is to find out which clusters will be beneficial for us to set up artificial turfs.

*    First we will discard the cluster where there are the most parks and playgrounds. People living near playgrounds will thick twice paying for artificial turfs when they can play for free.

*  Secondly, we will look out for wards which has ease of connectivity. Wards with more number of bus stops, train stations will be beneficial for us.

*   We will also look for schools, colleges and commercial complexes and the age group which are most interested in sports are children, young adults and adults.











### Step 2: Data Understanding.

We will use the K-means clustering algorithm to segment and cluster neighbourhoods based on the nearby venues which we will obtain from the Foursquare API.

To use the foursquare API we will need to get hold of a dataset which will contain the list of neighbourhoods along with latitude and longitude values of the neighbourhoods. But, as the neighbourhoods in the city of kolkata are quite large in area, we will instead use Wards (smallest unit areas designated by the Kolkata Municipal Corporation of ease of governance).

The wards are governed by local councillors elected by the public in the municipality elections. There are 143 wards in kolkata maintained the Kolkata Municipal Corporation (KMC). The wards are designated by numbers. The KMC website contains the information regarding the wards like Name of the councillors and office address. We will need the office Address to figure out accurate latitude and longitude values of the wards.

### Step 3: Data Collection.

It was difficult to scrape data frome KMC website because the dataset was divided into 15 pages. I searched on google and find out that someone already prepared the same dataset I was looking for. Here's the link to the dataset:

http://dev.opencity.in/dataset/50dad059-13b7-499d-aa61-9bf78fef7267/resource/dfdbde68-68e2-4c63-ac2d-faea1deb998a/download/kolkota-kmc-councillors-2018-1.csv

Copy paste this link in a new tab of your browser.

Let us first import all the libraries.

In [3]:
pip install folium



In [4]:
import numpy as np # library to handle data in a vectorized manner
 
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
 
import json # library to handle JSON files
 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
 
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
 
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt


# import k-means from clustering stage
from sklearn.cluster import KMeans
 
import folium # map rendering library

# libraries to read dataset from google drive
import requests 
from io import StringIO
 
print('Libraries imported.')

Libraries imported.


Now we will install Opencage which is a free geocoding API. We will use forward geocoding which is converting Address into Latitude and Longitude values.

In [5]:
pip install opencage

Collecting opencage
  Downloading https://files.pythonhosted.org/packages/44/56/e912b950ab7b05902c08ebc3eb6c6e22f40ca2657194e04fc205a9d793e7/opencage-1.2.2-py3-none-any.whl
Collecting backoff>=1.10.0
  Downloading https://files.pythonhosted.org/packages/f0/32/c5dd4f4b0746e9ec05ace2a5045c1fc375ae67ee94355344ad6c7005fd87/backoff-1.10.0-py2.py3-none-any.whl
Collecting pyopenssl>=0.15.1
[?25l  Downloading https://files.pythonhosted.org/packages/b2/5e/06351ede29fd4899782ad335c2e02f1f862a887c20a3541f17c3fa1a3525/pyOpenSSL-20.0.1-py2.py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 4.9MB/s 
Collecting cryptography>=3.2
[?25l  Downloading https://files.pythonhosted.org/packages/fa/af/fe27c2cd875bb0621d7fedd8b10dd5bd437e6f40ad2436d115d4e7b8c576/cryptography-3.4.1-cp36-abi3-manylinux2014_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 6.2MB/s 
Collecting setuptools-rust>=0.11.4
  Downloading https://files.pythonhosted.org/packages/e5/e3/ede8b8545e7ce7

In [6]:
from opencage.geocoder import OpenCageGeocode

In [7]:
# The code was removed by Watson Studio for sharing.

Looking at the office addresses of the KMC dataset, I saw that some of the addresses were not written properly to be fed into the OpenCage API. So I cleaned the dataset and upload it on my Github.

In [8]:
url = 'https://raw.githubusercontent.com/dipayan10/Kolkata_dataset/main/df_kolkata_final.csv'

Now we read the CSV file from the url of my GitHub and name it df_kolkata.

In [9]:
df_kolkata = pd.read_csv(url)
df_kolkata.head()

Unnamed: 0,Ward No.,Councillors,Office Address
0,1,SMT. SITA JAISWARA,"9A, Gopi Mondal Lane, Kolkata-700002"
1,2,SMT. PUSPALI SINHA,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk..."
2,3,DR SANTANU SEN,"Seth Lane, Kolkata-50"
3,4,SHRI GOUTAM HALDAR,"9A, Raja Manindra Road, Kolkata-37"
4,5,SHRI TARUN SAHA,"67B, Khelat Babu Lane, Kolkata-37"


Now we try to learn more about the dataset, shape of the dataset and the data types.

In [10]:
df_kolkata.shape

(144, 3)

In [11]:
df_kolkata.dtypes

Ward No.           int64
Councillors       object
Office Address    object
dtype: object

Now we try to run a trial of the Opencage API to see if it works.

In [12]:
query = df_kolkata['Office Address'][0]
query

'9A, Gopi Mondal Lane, Kolkata-700002'

In [13]:
api_url = f'https://api.opencagedata.com/geocode/v1/json?q={query}&key=c0ca14c914034184a6792d41901f6a1e&language=en&pretty=1'
data = requests.get(api_url).json()
print(data['results'])



[{'annotations': {'DMS': {'lat': "22° 33' 45.46800'' N", 'lng': "88° 21' 46.94400'' E"}, 'MGRS': '45QXE4014195744', 'Maidenhead': 'NL42en35na', 'Mercator': {'x': 9836528.618, 'y': 2562822.943}, 'OSM': {'note_url': 'https://www.openstreetmap.org/note/new#map=17/22.56263/88.36304&layers=N', 'url': 'https://www.openstreetmap.org/?mlat=22.56263&mlon=88.36304#map=17/22.56263/88.36304'}, 'UN_M49': {'regions': {'ASIA': '142', 'IN': '356', 'SOUTHERN_ASIA': '034', 'WORLD': '001'}, 'statistical_groupings': ['LEDC']}, 'callingcode': 91, 'currency': {'alternate_symbols': ['Rs', '৳', '૱', '௹', 'रु', '₨'], 'decimal_mark': '.', 'html_entity': '&#x20b9;', 'iso_code': 'INR', 'iso_numeric': '356', 'name': 'Indian Rupee', 'smallest_denomination': 50, 'subunit': 'Paisa', 'subunit_to_unit': 100, 'symbol': '₹', 'symbol_first': 1, 'thousands_separator': ','}, 'flag': '🇮🇳', 'geohash': 'tunb6g2hb544sbp4kre4', 'qibla': 278.22, 'roadinfo': {'drive_on': 'left', 'speed_in': 'km/h'}, 'sun': {'rise': {'apparent': 16

As we can see, the 'results' key of the json contains all the information. Then we try to seperate the lat and long values from the rest of the json.

In [14]:
demo_Latitude = data['results'][0]['geometry']['lat']
demo_Longitude = data['results'][0]['geometry']['lng']

Now we create a new dataframe called latlong_dataframe and append a series in it which contains the values of the following columns.

In [15]:
my_columns = ['Office Address', 'Latitude', 'Longitude']
latlong_dataframe = pd.DataFrame(columns = my_columns)

In [16]:
latlong_dataframe.append(
       pd.Series(
    [
           df_kolkata['Office Address'][0],
           demo_Latitude,
           demo_Longitude
       ],
   index = my_columns
),
  ignore_index= True
)

Unnamed: 0,Office Address,Latitude,Longitude
0,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304


Now it's time to populate the dataframe using the rest of the office addresses. We will loop over the Opencage API call, seperate out the lat long values and append them to the latlong_dataframe.

In [17]:
latlong_dataframe = pd.DataFrame(columns = my_columns)
for address in df_kolkata['Office Address']:
    api_url = f'https://api.opencagedata.com/geocode/v1/json?q={address}&key=c0ca14c914034184a6792d41901f6a1e&language=en&pretty=1'
    data = requests.get(api_url).json()
    latlong_dataframe = latlong_dataframe.append(
         pd.Series(
         [
             address,
             data['results'][0]['geometry']['lat'],
             data['results'][0]['geometry']['lng']
         ],
         index = my_columns),
    ignore_index = True
    )

In [18]:
latlong_dataframe.head()

Unnamed: 0,Office Address,Latitude,Longitude
0,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304
1,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk...",22.56263,88.36304
2,"Seth Lane, Kolkata-50",22.586374,88.353731
3,"9A, Raja Manindra Road, Kolkata-37",22.612521,88.383454
4,"67B, Khelat Babu Lane, Kolkata-37",22.56263,88.36304


And finally we merge the two dataframes into one on the column "Office Address".

In [19]:
df_kolkata = pd.merge(df_kolkata, latlong_dataframe, on="Office Address")

In [20]:
df_kolkata.head(20)

Unnamed: 0,Ward No.,Councillors,Office Address,Latitude,Longitude
0,1,SMT. SITA JAISWARA,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304
1,2,SMT. PUSPALI SINHA,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk...",22.56263,88.36304
2,3,DR SANTANU SEN,"Seth Lane, Kolkata-50",22.586374,88.353731
3,4,SHRI GOUTAM HALDAR,"9A, Raja Manindra Road, Kolkata-37",22.612521,88.383454
4,5,SHRI TARUN SAHA,"67B, Khelat Babu Lane, Kolkata-37",22.56263,88.36304
5,6,SMT. SUMAN SINGH,"3/1B ,Turner Road, Kolkata-02",22.56263,88.36304
6,7,SHRI BAPI GHOSH,"47A, Bagbazar Street, Kolkata-700003",22.603919,88.366481
7,8,SHRI PARTHA MITRA,"3/1E Raja Raj Ballav Street, Kolkata-03",22.56263,88.36304
8,9,SMT. MITALI SAHA,"13A Madan Mohantala Street, Kolkata-05",22.56263,88.36304
9,10,SMT. KARUNA SENGUPTA,"Raja Nabakrishna Street, Kolkata-05",22.596719,88.366462


In [57]:
df_kolkata.drop(index=16, inplace = True)

Now our updated dataframe contains the Ward numbers, Office addresses, latitudes and longitudes.

Our next task is to collect nearby venues data from Foursquare API. To call a get request we must obtain the latitude and longitude values of the center point of kolkata which is Park Street. We will use the GeoPy library for this task. 

In [22]:
address = 'Park Street, Kolkata, India'

geolocator = Nominatim(user_agent="kolkata_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Kolkata are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Kolkata are 22.5551591, 88.3501171.


Now we write the foursquare API credentials. As this information is sensitive , i am going to hide this cell.

In [23]:
# The code was removed by Watson Studio for sharing.

To visualise the data we have collected so far, we superimpose the wards on a follium map.

In [58]:
# create map of New York using latitude and longitude values
map_kolkata = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, ward in zip(df_kolkata['Latitude'], df_kolkata['Longitude'], df_kolkata['Ward No.']):
    label = '{}'.format(ward)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_kolkata)  
    
map_kolkata

Now we write a function called 'getNearbyVenues' where we take the location values from our df_kolkata dataset, make an api call, recieve a json response of the nearby venues, parse the json specifically for venue categories and venue location data and then append all the data on a pandas dataframe.

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Ward', 
                  'Ward Latitude', 
                  'Ward Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we call the function and create a dataframe called kolkata_venues.

In [59]:
# type your answer here
kolkata_venues = getNearbyVenues(names=df_kolkata['Ward No.'],
                                   latitudes=df_kolkata['Latitude'],
                                   longitudes=df_kolkata['Longitude']
                                  )

1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144


We have now reached the end of our data collection stage. In the final dataframe we have ward numbers, ward location data, venues, venue categories, and venue location data. We have all the features we need to perform exploratory data analysis and feed the data into our machine learning model.

In [60]:
kolkata_venues.shape

(720, 7)

In [61]:
kolkata_venues.head(10)

Unnamed: 0,Ward,Ward Latitude,Ward Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1,22.56263,88.36304,Candid Photographer,22.563947,88.362629,Photography Studio
1,1,22.56263,88.36304,Raja Subodh Mallick Square,22.56361,88.359668,Park
2,1,22.56263,88.36304,Georgian Inn,22.559555,88.361805,Hotel
3,1,22.56263,88.36304,Hind INOX,22.564914,88.359152,Multiplex
4,1,22.56263,88.36304,Santosh Mitra Square,22.566008,88.365719,Park
5,1,22.56263,88.36304,Entally Market,22.559952,88.366698,Market
6,2,22.56263,88.36304,Candid Photographer,22.563947,88.362629,Photography Studio
7,2,22.56263,88.36304,Raja Subodh Mallick Square,22.56361,88.359668,Park
8,2,22.56263,88.36304,Georgian Inn,22.559555,88.361805,Hotel
9,2,22.56263,88.36304,Hind INOX,22.564914,88.359152,Multiplex


## Step 4: Data Science Methodology

The data science methodology consists of many processes which include Exploratory Data analysis, data preparation, model building, model deployment, model development etc. We know that data science methodology is an itterative process and it is key that we should keep repeating the steps untill we get desired results. 

#### Exploratory Data Analysis

We will use the groupy method to get a scenario of number of venues in each ward.

In [62]:
kolkata_venues.groupby('Ward').count()

Unnamed: 0_level_0,Ward Latitude,Ward Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Ward,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,6,6,6,6,6,6
10,5,5,5,5,5,5
100,6,6,6,6,6,6
101,16,16,16,16,16,16
102,4,4,4,4,4,4
103,1,1,1,1,1,1
104,6,6,6,6,6,6
105,2,2,2,2,2,2
106,5,5,5,5,5,5
107,5,5,5,5,5,5


The one hot encoding method is used to convert categorical variables into binary which better suited to be fitted into our machine learning model.

In [63]:
# one hot encoding
kolkata_onehot = pd.get_dummies(kolkata_venues[['Venue Category']], prefix="", prefix_sep="")

# add Ward column back to dataframe
kolkata_onehot['Ward'] = kolkata_venues['Ward']

# move Ward column to the first column
fixed_columns = [kolkata_onehot.columns[-1]] + list(kolkata_onehot.columns[:-1])
kolkata_onehot = kolkata_onehot[fixed_columns]

kolkata_onehot.head()

Unnamed: 0,Ward,ATM,American Restaurant,Art Museum,Asian Restaurant,Athletics & Sports,Awadhi Restaurant,Bakery,Bank,Bengali Restaurant,Bistro,Bookstore,Boutique,Breakfast Spot,Bus Station,Café,Campground,Chinese Restaurant,Clothing Store,Convenience Store,Cosmetics Shop,Department Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Historic Site,History Museum,Hostel,Hotel,IT Services,Ice Cream Shop,Indian Restaurant,Indian Sweet Shop,Indie Movie Theater,Insurance Office,Italian Restaurant,Jewelry Store,Juice Bar,Kerala Restaurant,Lake,Lounge,Market,Men's Store,Metro Station,Mobile Phone Shop,Movie Theater,Mughlai Restaurant,Multiplex,Music Venue,Nightclub,Park,Performing Arts Venue,Pharmacy,Photography Studio,Pier,Pizza Place,Playground,Plaza,Pool,Residential Building (Apartment / Condo),Restaurant,River,Sausage Shop,Shipping Store,Shoe Store,Shopping Mall,Snack Place,Spa,Supermarket,Tea Room,Train Station,Tram Station,Vegetarian / Vegan Restaurant,Women's Store
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now let's group the wards together and take the mean of the binary values. This will help us understand which venues are more frequent in their respective wards and name the new dataframe kolkata_grouped.

In [64]:
kolkata_grouped = kolkata_onehot.groupby('Ward').mean().reset_index()
kolkata_grouped

Unnamed: 0,Ward,ATM,American Restaurant,Art Museum,Asian Restaurant,Athletics & Sports,Awadhi Restaurant,Bakery,Bank,Bengali Restaurant,Bistro,Bookstore,Boutique,Breakfast Spot,Bus Station,Café,Campground,Chinese Restaurant,Clothing Store,Convenience Store,Cosmetics Shop,Department Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Historic Site,History Museum,Hostel,Hotel,IT Services,Ice Cream Shop,Indian Restaurant,Indian Sweet Shop,Indie Movie Theater,Insurance Office,Italian Restaurant,Jewelry Store,Juice Bar,Kerala Restaurant,Lake,Lounge,Market,Men's Store,Metro Station,Mobile Phone Shop,Movie Theater,Mughlai Restaurant,Multiplex,Music Venue,Nightclub,Park,Performing Arts Venue,Pharmacy,Photography Studio,Pier,Pizza Place,Playground,Plaza,Pool,Residential Building (Apartment / Condo),Restaurant,River,Sausage Shop,Shipping Store,Shoe Store,Shopping Mall,Snack Place,Spa,Supermarket,Tea Room,Train Station,Tram Station,Vegetarian / Vegan Restaurant,Women's Store
0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.333333,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,100,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.333333,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,101,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0
4,102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.333333,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,105,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,106,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [65]:
kolkata_grouped.shape

(135, 80)

In [66]:
kolkata_grouped['Ward'] = kolkata_grouped['Ward'].astype(str)
kolkata_grouped.dtypes

Ward                                         object
ATM                                         float64
American Restaurant                         float64
Art Museum                                  float64
Asian Restaurant                            float64
Athletics & Sports                          float64
Awadhi Restaurant                           float64
Bakery                                      float64
Bank                                        float64
Bengali Restaurant                          float64
Bistro                                      float64
Bookstore                                   float64
Boutique                                    float64
Breakfast Spot                              float64
Bus Station                                 float64
Café                                        float64
Campground                                  float64
Chinese Restaurant                          float64
Clothing Store                              float64
Convenience 

To use the clustering alogrithm we need to perform exploratory data analysis on our dataset. The algorithm will cluster the wards according to the frequency of the venues occuring. That's why we now create a function which will list all the wards with their respective top 5 venues.

In [67]:
num_top_venues = 5

for wards in kolkata_grouped['Ward']:
    print("----"+wards+"----")
    temp = kolkata_grouped[kolkata_grouped['Ward'] == wards].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----1----
                venue  freq
0                Park  0.33
1  Photography Studio  0.17
2              Market  0.17
3               Hotel  0.17
4           Multiplex  0.17


----10----
                  venue  freq
0  Fast Food Restaurant   0.2
1     Indian Sweet Shop   0.2
2                Market   0.2
3                  Bank   0.2
4         Metro Station   0.2


----100----
                venue  freq
0                Park  0.33
1  Photography Studio  0.17
2              Market  0.17
3               Hotel  0.17
4           Multiplex  0.17


----101----
                   venue  freq
0                   Café  0.38
1                 Bakery  0.12
2            Pizza Place  0.06
3      Indian Restaurant  0.06
4  Performing Arts Venue  0.06


----102----
                venue  freq
0            Pharmacy  0.25
1              Market  0.25
2  Chinese Restaurant  0.25
3              Bistro  0.25
4                 ATM  0.00


----103----
                   venue  freq
0          Train Sta

Now we write a function to sort the venues in descending order.

In [68]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's use the function and show the top 10 venues in a pandas dataframe.

In [69]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Ward']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
Wards_venues_sorted = pd.DataFrame(columns=columns)
Wards_venues_sorted['Ward'] = kolkata_grouped['Ward']

for ind in np.arange(kolkata_grouped.shape[0]):
    Wards_venues_sorted.iloc[ind, 1:] = return_most_common_venues(kolkata_grouped.iloc[ind, :], num_top_venues)

Wards_venues_sorted.head()

Unnamed: 0,Ward,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
1,10,Indian Sweet Shop,Fast Food Restaurant,Bank,Market,Metro Station,Gym / Fitness Center,Dhaba,Electronics Store,Flea Market,Furniture / Home Store
2,100,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
3,101,Café,Bakery,Ice Cream Shop,Performing Arts Venue,Indian Restaurant,Pizza Place,Restaurant,Nightclub,Hotel,Tea Room
4,102,Bistro,Chinese Restaurant,Market,Pharmacy,Gym,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store


###  Clustering Neighborhoods


Now we enter the stage where we use our machine learning algorithm which is K-means clustering. 

K-means will group the wards into 5 groups depending on the likeness of the wards with respect to each other.

In [70]:
# set number of clusters
kclusters = 5
 
kolkata_grouped_clustering = kolkata_grouped.drop('Ward', 1)
 
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kolkata_grouped_clustering)
 
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]

array([1, 2, 1, 4, 2, 0, 1, 2, 2, 2, 4, 4, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1,
       1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2, 1, 2, 4, 1, 1, 3, 1, 1, 1, 1, 2,
       1, 2, 1, 4, 4, 2, 2, 4, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1,
       1, 2, 2, 2, 2, 1, 1, 4, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1], dtype=int32)

Now we add the 'Cluster Labels' to the dataframe by creating a new column and merging it with the 'Ward_venues_sorted' and name the new dataframe 'kolkata_merged'.

In [71]:
Wards_venues_sorted.dtypes

Ward                      object
1st Most Common Venue     object
2nd Most Common Venue     object
3rd Most Common Venue     object
4th Most Common Venue     object
5th Most Common Venue     object
6th Most Common Venue     object
7th Most Common Venue     object
8th Most Common Venue     object
9th Most Common Venue     object
10th Most Common Venue    object
dtype: object

Which ward belongs to which group will be determined by the cluster labels.

In [72]:
# add clustering labels
Wards_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)




In [128]:
kolkata_merged = df_kolkata
kolkata_merged.head(10)

Unnamed: 0,Ward No.,Councillors,Office Address,Latitude,Longitude
0,1,SMT. SITA JAISWARA,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304
1,2,SMT. PUSPALI SINHA,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk...",22.56263,88.36304
2,3,DR SANTANU SEN,"Seth Lane, Kolkata-50",22.586374,88.353731
3,4,SHRI GOUTAM HALDAR,"9A, Raja Manindra Road, Kolkata-37",22.612521,88.383454
4,5,SHRI TARUN SAHA,"67B, Khelat Babu Lane, Kolkata-37",22.56263,88.36304
5,6,SMT. SUMAN SINGH,"3/1B ,Turner Road, Kolkata-02",22.56263,88.36304
6,7,SHRI BAPI GHOSH,"47A, Bagbazar Street, Kolkata-700003",22.603919,88.366481
7,8,SHRI PARTHA MITRA,"3/1E Raja Raj Ballav Street, Kolkata-03",22.56263,88.36304
8,9,SMT. MITALI SAHA,"13A Madan Mohantala Street, Kolkata-05",22.56263,88.36304
9,10,SMT. KARUNA SENGUPTA,"Raja Nabakrishna Street, Kolkata-05",22.596719,88.366462


In [129]:
df_kolkata['Ward No.'] = df_kolkata['Ward No.'].astype(str)

In [130]:
df_kolkata.dtypes

Ward No.           object
Councillors        object
Office Address     object
Latitude          float64
Longitude         float64
dtype: object

In [131]:
kolkata_merged = df_kolkata

# merge kolkata_grouped with kolkata_data to add latitude/longitude for each neighborhood
kolkata_merged = kolkata_merged.join(Wards_venues_sorted.set_index('Ward'), on='Ward No.')


Now we prepare our final dataframe kolkata_merged, which will contain our previous dataset df_kolkata (ward numbers, Office Address, latitude, longitude) , cluster labels and the most common venues.

In [132]:
kolkata_merged = kolkata_merged.fillna(0)

In [133]:
kolkata_merged['Cluster Labels'] = kolkata_merged['Cluster Labels'].astype(int)

In [134]:
kolkata_merged.dtypes

Ward No.                   object
Councillors                object
Office Address             object
Latitude                  float64
Longitude                 float64
Cluster Labels              int64
1st Most Common Venue      object
2nd Most Common Venue      object
3rd Most Common Venue      object
4th Most Common Venue      object
5th Most Common Venue      object
6th Most Common Venue      object
7th Most Common Venue      object
8th Most Common Venue      object
9th Most Common Venue      object
10th Most Common Venue     object
dtype: object

In [135]:
kolkata_merged.head()

Unnamed: 0,Ward No.,Councillors,Office Address,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,SMT. SITA JAISWARA,"9A, Gopi Mondal Lane, Kolkata-700002",22.56263,88.36304,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
1,2,SMT. PUSPALI SINHA,"10, Rani Debendra Bala Rd Rishi, Paikpara Kolk...",22.56263,88.36304,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
2,3,DR SANTANU SEN,"Seth Lane, Kolkata-50",22.586374,88.353731,3,ATM,Snack Place,Market,Gym / Fitness Center,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
3,4,SHRI GOUTAM HALDAR,"9A, Raja Manindra Road, Kolkata-37",22.612521,88.383454,2,ATM,Pharmacy,Park,Music Venue,Electronics Store,Dhaba,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
4,5,SHRI TARUN SAHA,"67B, Khelat Babu Lane, Kolkata-37",22.56263,88.36304,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market


Now we will use follium to visualise the clusters superimposed on terrestrial map. We will select different colours for different clusters which will help us visualise.

Now let's visualize the clusters 

In [101]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kolkata_merged['Latitude'], kolkata_merged['Longitude'], kolkata_merged['Ward No.'], kolkata_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Step 5: Generating insights

### Examine Clusters


#### Cluster 1


This cluster mainly contains the wards for which Foursqaure API has no information. So we can ignore this cluster for now.

In [82]:
kolkata_merged.loc[kolkata_merged['Cluster Labels'] == 0, kolkata_merged.columns[[1] + list(range(5, kolkata_merged.shape[1]))]]

Unnamed: 0,Councillors,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
26,SMT. MINAKSHI GUPTA,0,0,0,0,0,0,0,0,0,0,0
58,SMT. JALY BOSE,0,0,0,0,0,0,0,0,0,0,0
59,KAISER JAMIL,0,0,0,0,0,0,0,0,0,0,0
70,SMT. PAPIA SINGH,0,0,0,0,0,0,0,0,0,0,0
77,NEZAMUDDIN SHAMS,0,0,0,0,0,0,0,0,0,0,0
88,SMT. MAMATA MAJUMDAR,0,Train Station,Movie Theater,Women's Store,Gym,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
95,SHRI DEBABRATA MAJUMDER,0,0,0,0,0,0,0,0,0,0,0
102,SMT. NANDITA ROY,0,Train Station,Women's Store,Gym / Fitness Center,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store,Gym
134,AKHTARI NIJAMI,0,0,0,0,0,0,0,0,0,0,0


#### Cluster 2


In [83]:
kolkata_merged.loc[kolkata_merged['Cluster Labels'] == 1, kolkata_merged.columns[[1] + list(range(5, kolkata_merged.shape[1]))]]

Unnamed: 0,Councillors,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SMT. SITA JAISWARA,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
1,SMT. PUSPALI SINHA,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
4,SHRI TARUN SAHA,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
5,SMT. SUMAN SINGH,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
7,SHRI PARTHA MITRA,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
8,SMT. MITALI SAHA,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
10,SHRI ATIN GHOSH,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
11,SMT. PRANATI BHATTACHARJEE,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
19,SHRI VIJAY UPADHYAY,1,Park,Metro Station,Art Museum,History Museum,Women's Store,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
20,SMT. SUJATA SAHA,1,Park,Photography Studio,Hotel,Multiplex,Market,Grocery Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market


We can see that most of the wards fall under the 2nd cluster. This cluster is the most important because we see that the most common venues are the parks. In the beginning, at the formulation of the business plan, we decided to avoid areas with the most parks. So we can avoid these wards till the model and the data is more refined.

#### Cluster 3

In [84]:
kolkata_merged.loc[kolkata_merged['Cluster Labels'] == 2, kolkata_merged.columns[[1] + list(range(5, kolkata_merged.shape[1]))]]

Unnamed: 0,Councillors,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,SHRI GOUTAM HALDAR,2,ATM,Pharmacy,Park,Music Venue,Electronics Store,Dhaba,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
6,SHRI BAPI GHOSH,2,Clothing Store,River,Pier,Chinese Restaurant,Bank,Park,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market
9,SMT. KARUNA SENGUPTA,2,Indian Sweet Shop,Fast Food Restaurant,Bank,Market,Metro Station,Gym / Fitness Center,Dhaba,Electronics Store,Flea Market,Furniture / Home Store
13,SHRI AMAL CHAKRABORTY,2,Gym,Clothing Store,Historic Site,Women's Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
14,SMT. SHUKLA BHORE,2,Clothing Store,Chinese Restaurant,Café,Bakery,Pool,Historic Site,Women's Store,Gym,Fast Food Restaurant,Flea Market
15,SHRI SADHAN SAHA,2,Vegetarian / Vegan Restaurant,Indian Restaurant,Plaza,Bakery,Gym,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store
18,SMT. SIKHA SAHA,2,IT Services,Women's Store,Gym / Fitness Center,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store,Gym
27,IQBAL AHMED,2,Women's Store,Chinese Restaurant,Department Store,Ice Cream Shop,Grocery Store,Dhaba,Indian Restaurant,Electronics Store,Fast Food Restaurant,Flea Market
32,SHRI PABITRA BISWAS,2,Chinese Restaurant,Restaurant,Gym / Fitness Center,Pharmacy,Women's Store,Furniture / Home Store,Department Store,Dhaba,Electronics Store,Fast Food Restaurant
33,SMT. ALOANANDA DAS,2,ATM,Pharmacy,Restaurant,Department Store,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store


The 3rd cluster is populated with clothing stores and restaurants. 

#### Cluster 4

In [85]:
kolkata_merged.loc[kolkata_merged['Cluster Labels'] == 3, kolkata_merged.columns[[1] + list(range(5, kolkata_merged.shape[1]))]]

Unnamed: 0,Councillors,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,DR SANTANU SEN,3,ATM,Snack Place,Market,Gym / Fitness Center,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
97,SHRI MRITYUNJOY CHAKRABORTY,3,ATM,Metro Station,Indian Restaurant,Ice Cream Shop,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
133,SHAMS IQBAL,3,ATM,Gym / Fitness Center,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store,Gym,Harbor / Marina


This cluster is helpful as the common venues are fitness centers which will attract fitness enthusiasts and metro stations which will ensure ease of transportation.

#### Cluster 5

In [86]:
kolkata_merged.loc[kolkata_merged['Cluster Labels'] == 4, kolkata_merged.columns[[1] + list(range(5, kolkata_merged.shape[1]))]]

Unnamed: 0,Councillors,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,SHRI SUNANDA SARKAR,4,Café,Bakery,Ice Cream Shop,Performing Arts Venue,Indian Restaurant,Pizza Place,Restaurant,Nightclub,Hotel,Tea Room
38,MD. JASIMUDDIN,4,Café,Bakery,Ice Cream Shop,Performing Arts Venue,Indian Restaurant,Pizza Place,Restaurant,Nightclub,Hotel,Tea Room
68,SHRI SUKDEV CHAKRABARTY,4,Café,Ice Cream Shop,Chinese Restaurant,Women's Store,Gym,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
85,SMT. TISHTA BISWAS (DAS),4,Café,Indian Restaurant,Ice Cream Shop,Hotel,Lounge,Clothing Store,Bengali Restaurant,Chinese Restaurant,Women's Store,Grocery Store
100,SHRI BAPPADITYA DASGUPTA,4,Café,Bakery,Ice Cream Shop,Performing Arts Venue,Indian Restaurant,Pizza Place,Restaurant,Nightclub,Hotel,Tea Room
107,SHRI SHYAMAL BANERJEE,4,Café,Women's Store,Cosmetics Shop,Dhaba,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store,Gym
108,SMT. ANANYA BANERJEE,4,Café,Clothing Store,Movie Theater,Women's Store,Gym,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
113,SHRI BISWAJIT MANDAL,4,Café,Lounge,Movie Theater,Women's Store,Gym,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
130,SHRI SOVAN CHATTERJEE,4,Convenience Store,Ice Cream Shop,Chinese Restaurant,Café,Gym,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store
142,SHRI INDRAJIT BHATTACHARYA,4,Café,Pizza Place,Bakery,Women's Store,Gym / Fitness Center,Electronics Store,Fast Food Restaurant,Flea Market,Furniture / Home Store,Grocery Store


The most common venues of this cluster is coffee shops and other restaurants. To think strategically, these areas have footfall and will positively impact our business. As adults and young adults often visit these areas, these locations might be beneficial for our business.

## Conclusion

So we are now at this end of this analysis. I have used all the knowledge i could gather from the IBM Professional Data Science Certificate courses. Feel we to comment where you think I can improve this project.

So Far, by my understanding, clusters 5 and 4 are the most important for artificial turfs. This locations will positively impact our business for the reasons I have provided earlier. For cluster 3 i am somewhat skeptical at the moment. I would very much appriciate feedback on how I can use this cluster to my advantage.