# IBM Applied Data Science Capstone Course by Coursera
### Week 5 Final Report
**_Opening a New Shopping Mall in Mumbai, India
- Build a dataframe of neighborhoods in Mumbai, India by web scraping the data from Wikipedia page
- Get the geographical coordinates of the neighborhoods
- Obtain the venue data for the neighborhoods from Foursquare API
- Explore and cluster the neighborhoods
- Select the best cluster to open a new shopping mall
***
### 1. Import libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

# !pip install geocoder
import geocoder # to get coordinates

# !pip install Nominatim
# !pip install geopy
# !pip install geopy[aiohttp]

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# !pip install BeautifulSoup4
import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 1.7MB/s ta 0:00:011
[?25hCollecting click (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl (82kB)
[K     |████████████████████████████████| 92kB 2.4MB/s eta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Collecting future (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
[K     |████████████████████████████████| 829kB 4.9MB/s eta 0:00:01
Building wheels 

### 2. Scrap data from Wikipedia page into a DataFrame

In [3]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai").text

In [4]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [5]:
# print(soup)

In [6]:
# create a list to store neighborhood data
ne = []
neighborhoodList = []

In [14]:
# append the data into the list
for row in soup.find_all("div", class_="toc")[0].findAll("li"):
    ne.append(row.text)

for i in range(1,43) :
    # ne.append(neighborhoodList[i].replace("\n",""))
    # print(neighborhoodList[i])
    if ((ne[i].find('Eastern') == -1) and (ne[i].find('Harbour') == -1) and (ne[i].find('South') == -1) ): 
        neighborhoodList.append(ne[i][ne[i].find(" ")+1:].replace("\n",""))

print(neighborhoodList)

['Andheri', 'Mira-Bhayandar', 'Bandra', 'Borivali', 'Dahisar', 'Goregaon', 'Jogeshwari', 'Juhu', 'Kandivali west', 'Kandivali east', 'Khar', 'Malad', 'Santacruz', 'Vasai', 'Virar', 'Vile Parle', 'Bhandup', 'Ghatkopar', 'Kanjurmarg', 'Kurla', 'Mulund', 'Powai', 'Vidyavihar', 'Vikhroli', 'Chembur', 'Govandi', 'Mankhurd', 'Trombay', 'Antop Hill', 'Byculla', 'Colaba', 'Dadar', 'Fort', 'Girgaon', 'Kalbadevi', 'Kamathipura', 'Matunga', 'Parel', 'Tardeo', 'Andheri', 'Mira-Bhayandar', 'Bandra', 'Borivali', 'Dahisar', 'Goregaon', 'Jogeshwari', 'Juhu', 'Kandivali west', 'Kandivali east', 'Khar', 'Malad', 'Santacruz', 'Vasai', 'Virar', 'Vile Parle', 'Bhandup', 'Ghatkopar', 'Kanjurmarg', 'Kurla', 'Mulund', 'Powai', 'Vidyavihar', 'Vikhroli', 'Chembur', 'Govandi', 'Mankhurd', 'Trombay', 'Antop Hill', 'Byculla', 'Colaba', 'Dadar', 'Fort', 'Girgaon', 'Kalbadevi', 'Kamathipura', 'Matunga', 'Parel', 'Tardeo']


In [8]:
# create a new DataFrame from the list
mi_df = pd.DataFrame({"Neighborhood": neighborhoodList})

mi_df.head()

Unnamed: 0,Neighborhood
0,Andheri
1,Mira-Bhayandar
2,Bandra
3,Borivali
4,Dahisar


In [9]:
# print the number of rows of the dataframe
mi_df.shape

(39, 1)

In [10]:
# @hidden_cell
email = 'parekh.bk@gmail.com'

In [29]:

from geopy.adapters import AioHTTPAdapter
from geopy.extra.rate_limiter import AsyncRateLimiter
from geopy.geocoders import Nominatim

async with Nominatim(
    user_agent=email,
    adapter_factory=AioHTTPAdapter,
) as geolocator:

    geocode = AsyncRateLimiter(geolocator.geocode, min_delay_seconds=1)
    locations = [await geocode(s) for s in neighborhoodList]

    locations
#     search = [
#         (55.47, 37.32), (48.85, 2.35), (52.51, 13.38),
#         (34.69, 139.40), (39.90, 116.39)
#     ]
#     reverse = AsyncRateLimiter(geolocator.reverse, min_delay_seconds=1)
#     locations = [await reverse(s) for s in search]

In [31]:
print(locations)

[Location(Kurla, S G Barve Marg, Naware Baug Jagruti Nagar, L Ward, Zone 5, Mumbai, Mumbai Suburban, Maharashtra, 400024, India, (19.0652797, 72.8793805, 0.0))]


### 3. Get the geographical coordinates

In [32]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Mumbai, India'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [33]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in mi_df["Neighborhood"].tolist() ]

In [34]:
coords

[[19.11846999503601, 72.8417699936972],
 [18.95051999900049, 72.82712001950813],
 [19.054370000000063, 72.84017000000006],
 [19.229360000000042, 72.85751000000005],
 [19.250030000000038, 72.85907000000003],
 [19.164550000000077, 72.84946000000008],
 [19.137920000000065, 72.84941000000003],
 [19.014920000000075, 72.84522000000004],
 [19.20694000000003, 72.83494000000007],
 [19.205760000000055, 72.86953000000005],
 [19.06913000000003, 72.84640000000007],
 [19.186550000000068, 72.84842000000003],
 [19.081770000000063, 72.84205000000003],
 [19.07934000000006, 72.83916000000005],
 [19.01657000000006, 72.85853000000003],
 [19.096240000000023, 72.85024000000004],
 [19.145560000000046, 72.94856000000004],
 [19.086299999191567, 72.90908002895725],
 [19.131380000000036, 72.93568000000005],
 [19.064980000000048, 72.88069000000007],
 [19.171830000000057, 72.95565000000005],
 [19.123100000000022, 72.90942000000007],
 [19.02324998240138, 72.84390000293212],
 [19.111090000000047, 72.92781000000008],


In [35]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [36]:
# merge the coordinates into the original dataframe
mi_df['Latitude'] = df_coords['Latitude']
mi_df['Longitude'] = df_coords['Longitude']

In [37]:
# check the neighborhoods and the coordinates
print(mi_df.shape)
mi_df

(39, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Andheri,19.11847,72.84177
1,Mira-Bhayandar,18.95052,72.82712
2,Bandra,19.05437,72.84017
3,Borivali,19.22936,72.85751
4,Dahisar,19.25003,72.85907
5,Goregaon,19.16455,72.84946
6,Jogeshwari,19.13792,72.84941
7,Juhu,19.01492,72.84522
8,Kandivali west,19.20694,72.83494
9,Kandivali east,19.20576,72.86953


In [38]:
# save the DataFrame as CSV file
mi_df.to_csv("mi_df.csv", index=False)

### 4. Create a map of Mumbai with neighborhoods superimposed on top

In [39]:
# @hidden_cell
email = 'parekh.bk@gmail.com'

In [40]:
# get the coordinates of Mumbai India
address = 'Mumbai, India'

geolocator = Nominatim(user_agent=email)
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Mumbai, India {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Mumbai, India 19.0759899, 72.8773928.


In [41]:
# create map of Mumbai using latitude and longitude values
map_mi = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(mi_df['Latitude'], mi_df['Longitude'], mi_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_mi)  

map_mi

In [42]:
# save the map as HTML file
map_mi.save('map_mi.html')

### 5. Use the Foursquare API to explore the neighborhoods

In [43]:
# @hidden_cell

# define Foursquare Credentials and Version
CLIENT_ID = 'I1QIUYQEL1T2J3AU5W14B4QA2CULGRPOB45RBZ43P0IBV0ZC' # your Foursquare ID
CLIENT_SECRET = '3M203KEHZC2XLL52DIFG23ZA4XT5S0TU3UDFSAPYNL4OXIFE' # your Foursquare Secret

In [44]:
# define Foursquare Credentials and Version
VERSION = '20180605' # Foursquare API version
radius = 10000
LIMIT = 200

print('Your credentails:')
# print('CLIENT_ID: ' + CLIENT_ID)
# print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:


**Now, let's get the top 100 venues that are within a radius of 2000 meters.**

In [45]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT) #,llAcc = 1000)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [46]:
venues_df = getNearbyVenues(names=mi_df['Neighborhood'], latitudes=mi_df['Latitude'], 
    longitudes=mi_df['Longitude'])
venues_df.head()
venues_df.to_csv('venus.csv')

**Let's check how many venues were returned for each neighorhood**

In [47]:
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Andheri,19,19,19,19,19,19
Antop Hill,5,5,5,5,5,5
Bandra,16,16,16,16,16,16
Bhandup,3,3,3,3,3,3
Borivali,13,13,13,13,13,13
Byculla,5,5,5,5,5,5
Chembur,24,24,24,24,24,24
Colaba,16,16,16,16,16,16
Dadar,25,25,25,25,25,25
Dahisar,9,9,9,9,9,9


**Let's find out how many unique categories can be curated from all the returned venues**

In [48]:
print('There are {} uniques categories.'.format(len(venues_df['Venue Category'].unique())))

There are 132 uniques categories.


In [49]:
# print out the list of categories
venues_df['Venue Category'].unique()[:200]

array(['Bakery', 'Indian Restaurant', 'Fast Food Restaurant',
       'Department Store', 'Vegetarian / Vegan Restaurant', 'Restaurant',
       'Athletics & Sports', 'Food Court', 'Salon / Barbershop',
       'Bookstore', 'Accessories Store', 'Golf Course',
       'Electronics Store', 'Burger Joint', 'Food', 'Bus Station', 'Café',
       'Convenience Store', 'Train Station', 'Design Studio',
       'Paper / Office Supplies Store', 'Brewery', 'Breakfast Spot',
       'Platform', 'Pier', 'Furniture / Home Store', 'Food Truck',
       'Ice Cream Shop', 'Chinese Restaurant', 'Clothing Store',
       'Snack Place', 'Pizza Place', 'Dog Run', 'Bar', 'Pet Store',
       'Seafood Restaurant', 'Mughlai Restaurant', 'Asian Restaurant',
       'Arts & Crafts Store', 'Food & Drink Shop', 'Coffee Shop',
       'Lounge', 'Hotel', 'Juice Bar', 'Farmers Market', 'Spa',
       'Gym / Fitness Center', 'Bike Rental / Bike Share', 'Neighborhood',
       'Dessert Shop', 'Gastropub', 'Movie Theater', 'Shoe St

In [50]:
cat = pd.DataFrame(venues_df['Venue Category'].unique()[:100])
# cat = pd.DataFrame(venues_df['Venue Category'].unique()[:500])
cat1 = venues_df[venues_df['Venue Category'].str.contains('Restaurant')]
# cat1 = cat[cat[0].str.contains('Restaurant')]
# cat1 = cat.contains('Restaurant')
cat1

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1,Andheri,19.11847,72.84177,Cafe Alfa,19.119667,72.84356,Indian Restaurant
2,Andheri,19.11847,72.84177,Radha Krishna Veg Restaurant,19.11513,72.84306,Indian Restaurant
3,Andheri,19.11847,72.84177,McDonald's,19.119691,72.846102,Fast Food Restaurant
5,Andheri,19.11847,72.84177,Radha Krishna (RK),19.114568,72.842758,Vegetarian / Vegan Restaurant
6,Andheri,19.11847,72.84177,Amar Restaurant,19.118193,72.84521,Restaurant
9,Andheri,19.11847,72.84177,Alpha Restaurant,19.118847,72.84491,Indian Restaurant
14,Andheri,19.11847,72.84177,Moti Mahal,19.120506,72.845677,Indian Restaurant
15,Andheri,19.11847,72.84177,Smoothies,19.119391,72.846145,Indian Restaurant
19,Mira-Bhayandar,18.95052,72.82712,Shree Thaker Bhojnalay,18.951217,72.828326,Indian Restaurant
20,Mira-Bhayandar,18.95052,72.82712,Bhagat Tarachand Restaurant,18.951802,72.830486,Indian Restaurant


In [51]:
# check if the results contain "Shopping Mall"
"Neighborhood" in venues_df['Venue Category'].unique()

True

### 6. Analyze Each Neighborhood

In [52]:
# one hot encoding
mi_onehot = pd.get_dummies(venues_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mi_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [mi_onehot.columns[-1]] + list(mi_onehot.columns[:-1])
mi_onehot = mi_onehot[fixed_columns]

print(mi_onehot.shape)
mi_onehot.head()

(584, 133)


Unnamed: 0,Neighborhoods,ATM,Accessories Store,American Restaurant,Antique Shop,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,Bar,Beach,Bed & Breakfast,Bengali Restaurant,Bike Rental / Bike Share,Bistro,Boat or Ferry,Bookstore,Boutique,Breakfast Spot,Brewery,Burger Joint,Bus Station,Café,Cheese Shop,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Diner,Dog Run,Donut Shop,Electronics Store,Farmers Market,Fast Food Restaurant,Field,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fried Chicken Joint,Furniture / Home Store,Gaming Cafe,Garden,Gastropub,German Restaurant,Gift Shop,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Historic Site,History Museum,Hookah Bar,Hostel,Hot Dog Joint,Hotel,Hotel Pool,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Lake,Lounge,Luggage Store,Maharashtrian Restaurant,Market,Mediterranean Restaurant,Men's Store,Middle Eastern Restaurant,Monument / Landmark,Movie Theater,Moving Target,Mughlai Restaurant,Multicuisine Indian Restaurant,Music Store,Neighborhood,Nightclub,Noodle House,North Indian Restaurant,Optical Shop,Paper / Office Supplies Store,Park,Parsi Restaurant,Pet Store,Pharmacy,Pier,Pizza Place,Platform,Playground,Plaza,Pub,Punjabi Restaurant,Recording Studio,Restaurant,Roof Deck,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Shoe Store,Shopping Mall,Smoke Shop,Snack Place,Soccer Field,South Indian Restaurant,Spa,Sports Bar,Steakhouse,Sushi Restaurant,Tea Room,Theme Park,Trail,Train Station,Vegetarian / Vegan Restaurant,Women's Store,Yoga Studio
0,Andheri,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Andheri,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Andheri,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Andheri,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Andheri,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category**

In [53]:
mi_grouped = mi_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(mi_grouped.shape)
mi_grouped

(39, 133)


Unnamed: 0,Neighborhoods,ATM,Accessories Store,American Restaurant,Antique Shop,Arcade,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bank,Bar,Beach,Bed & Breakfast,Bengali Restaurant,Bike Rental / Bike Share,Bistro,Boat or Ferry,Bookstore,Boutique,Breakfast Spot,Brewery,Burger Joint,Bus Station,Café,Cheese Shop,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Diner,Dog Run,Donut Shop,Electronics Store,Farmers Market,Fast Food Restaurant,Field,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fried Chicken Joint,Furniture / Home Store,Gaming Cafe,Garden,Gastropub,German Restaurant,Gift Shop,Golf Course,Grocery Store,Gym,Gym / Fitness Center,Historic Site,History Museum,Hookah Bar,Hostel,Hot Dog Joint,Hotel,Hotel Pool,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Lake,Lounge,Luggage Store,Maharashtrian Restaurant,Market,Mediterranean Restaurant,Men's Store,Middle Eastern Restaurant,Monument / Landmark,Movie Theater,Moving Target,Mughlai Restaurant,Multicuisine Indian Restaurant,Music Store,Neighborhood,Nightclub,Noodle House,North Indian Restaurant,Optical Shop,Paper / Office Supplies Store,Park,Parsi Restaurant,Pet Store,Pharmacy,Pier,Pizza Place,Platform,Playground,Plaza,Pub,Punjabi Restaurant,Recording Studio,Restaurant,Roof Deck,Salon / Barbershop,Sandwich Place,Seafood Restaurant,Shoe Store,Shopping Mall,Smoke Shop,Snack Place,Soccer Field,South Indian Restaurant,Spa,Sports Bar,Steakhouse,Sushi Restaurant,Tea Room,Theme Park,Trail,Train Station,Vegetarian / Vegan Restaurant,Women's Store,Yoga Studio
0,Andheri,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.052632,0.0,0.0,0.0,0.052632,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.263158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.052632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0
1,Antop Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bandra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0
3,Bhandup,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0
4,Borivali,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.076923,0.0,0.153846,0.153846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.153846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Byculla,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Chembur,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.041667,0.0,0.041667,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0
7,Colaba,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.125,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Dadar,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.12,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.04,0.04,0.0
9,Dahisar,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0


In [54]:
len(mi_grouped[mi_grouped["Shopping Mall"] > 0])

5

**Create a new DataFrame for Shopping Mall data only**

In [55]:
mi_mall = mi_grouped[["Neighborhoods","Shopping Mall"]]

In [56]:
mi_mall.head(10)

Unnamed: 0,Neighborhoods,Shopping Mall
0,Andheri,0.0
1,Antop Hill,0.0
2,Bandra,0.0
3,Bhandup,0.0
4,Borivali,0.0
5,Byculla,0.0
6,Chembur,0.0
7,Colaba,0.0
8,Dadar,0.04
9,Dahisar,0.0


### 7. Cluster Neighborhoods
Run k-means to cluster the neighborhoods in Kuala Lumpur into 3 clusters.

In [57]:
# set number of clusters
kclusters = 3

mi_clustering = mi_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mi_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 2, 0], dtype=int32)

In [58]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
mi_merged = mi_mall.copy()

# add clustering labels
mi_merged["Cluster Labels"] = kmeans.labels_

In [59]:
mi_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
mi_merged.head()

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels
0,Andheri,0.0,0
1,Antop Hill,0.0,0
2,Bandra,0.0,0
3,Bhandup,0.0,0
4,Borivali,0.0,0


In [60]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
mi_merged = mi_merged.join(mi_df.set_index("Neighborhood"), on="Neighborhood")

print(mi_merged.shape)
mi_merged.head() # check the last columns!

(39, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Andheri,0.0,0,19.11847,72.84177
1,Antop Hill,0.0,0,19.02614,72.86645
2,Bandra,0.0,0,19.05437,72.84017
3,Bhandup,0.0,0,19.14556,72.94856
4,Borivali,0.0,0,19.22936,72.85751


In [61]:
# sort the results by Cluster Labels
print(mi_merged.shape)
mi_merged.sort_values(["Cluster Labels"], inplace=True)
mi_merged

(39, 5)


Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Andheri,0.0,0,19.11847,72.84177
37,Vile Parle,0.0,0,19.09624,72.85024
20,Kandivali west,0.0,0,19.20694,72.83494
21,Kanjurmarg,0.0,0,19.13138,72.93568
22,Khar,0.0,0,19.06913,72.8464
23,Kurla,0.0,0,19.06498,72.88069
24,Malad,0.0,0,19.18655,72.84842
18,Kamathipura,0.0,0,18.96172,72.82625
25,Mankhurd,0.0,0,19.04853,72.93222
28,Mulund,0.0,0,19.17183,72.95565


**Finally, let's visualize the resulting clusters**

In [62]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mi_merged['Latitude'], mi_merged['Longitude'], mi_merged['Neighborhood'], mi_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [63]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

### 8. Examine Clusters

#### Cluster 0

In [64]:
mi_merged.loc[mi_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Andheri,0.0,0,19.11847,72.84177
37,Vile Parle,0.0,0,19.09624,72.85024
20,Kandivali west,0.0,0,19.20694,72.83494
21,Kanjurmarg,0.0,0,19.13138,72.93568
22,Khar,0.0,0,19.06913,72.8464
23,Kurla,0.0,0,19.06498,72.88069
24,Malad,0.0,0,19.18655,72.84842
18,Kamathipura,0.0,0,18.96172,72.82625
25,Mankhurd,0.0,0,19.04853,72.93222
28,Mulund,0.0,0,19.17183,72.95565


#### Cluster 1

In [65]:
mi_merged.loc[mi_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
32,Tardeo,0.052632,1,18.97254,72.81478
35,Vidyavihar,0.052632,1,19.02325,72.8439


#### Cluster 2

In [66]:
mi_merged.loc[mi_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude
8,Dadar,0.04,2,19.01991,72.84086
30,Powai,0.033333,2,19.1231,72.90942
26,Matunga,0.027027,2,19.0272,72.85589


#### Cluster 3

In [67]:
mi_merged.loc[mi_merged['Cluster Labels'] == 3]

Unnamed: 0,Neighborhood,Shopping Mall,Cluster Labels,Latitude,Longitude


#### Observations:
Most of the shopping malls are concentrated in the central area of Kuala Lumpur city, with the highest number in cluster 2 and moderate number in cluster 0. On the other hand, cluster 1 has very low number to totally no shopping mall in the neighborhoods. This represents a great opportunity and high potential areas to open new shopping malls as there is very little to no competition from existing malls. Meanwhile, shopping malls in cluster 2 are likely suffering from intense competition due to oversupply and high concentration of shopping malls. From another perspective, this also shows that the oversupply of shopping malls mostly happened in the central area of the city, with the suburb area still have very few shopping malls. Therefore, this project recommends property developers to capitalize on these findings to open new shopping malls in neighborhoods in cluster 1 with little to no competition. Property developers with unique selling propositions to stand out from the competition can also open new shopping malls in neighborhoods in cluster 0 with moderate competition. Lastly, property developers are advised to avoid neighborhoods in cluster 2 which already have high concentration of shopping malls and suffering from intense competition.