### Table of Contents

   - Introduction
   - Objectives
   - Data
   - Methodology
       - Analyze Chennai Thiruvanmiyur
       - K-mean Cluster Chennai Thiruvanmiyur
       - Analyze Chennai Ramapuram
       - K-mean Cluster Chennai Ramapuram
   - Results
   - Discussion
   - Conclusion

### Introduction

Chennai is the capital of the Indian state of Tamil Nadu. Located on the Coromandel Coast off the Bay of Bengal, it is the biggest cultural, economic and educational centre of south India. It become a center of attention for residential, job employment, tourism, education, shopping and sports activity. Chennai had the third-largest expatriate population in India. Thiruvanmiyur and Ramapuram are identified as two populated zones in Chennai by sorting the dataframe.  

Brief information about both places:

   Thiruvanmiyur is a largely residential neighborhood in the south of Chennai, Tamil Nadu, India. Thiruvanmiyur witnessed a spike in its economy with the construction of Chennai's first dedicated technology office space, the Tidel Information Technology Park in neighboring Taramani. The subsequent rise of several information technology businesses, research centres and offices around Tidel park proved fortuitous for Thiruvanmiyur, as many of the workers at these offices often made Thiruvanmiyur their home.(source: https://en.wikipedia.org/wiki/Thiruvanmiyur)

   Ramapuram is a neighbourhood in the western part of Chennai, India.It is also famous for Arasamaram Temple which was nearly 100 years old and Lakshmi Narasimha Perumal Temple (Lakshmi Narasimhan idol is approximately 2000 years old). The neighbourhood is surrounded by big hospitals like MIOT Hospital and SRM. Large organizations like L&T InfoTech, IBHYA, and DLF IT Park are also located in Ramapuram.(source: https://en.wikipedia.org/wiki/Ramapuram,_Chennai)


### Objective

In this project, we will study in details the area classification using Foursquare data and machine learning segmentation and clustering. The aim of this project is to segment areas of Thiruvanmiyur and Ramapuram based on the most common places captured from Foursquare.

Using segmentation and clustering, we hope we can determine:

    the similarity or dissimilarity of both places
    classification of area located inside the city whether it is residential, tourism places, or others

### Data

The data acquired from Open Government Data (OGD) Platform India - data.gov.in is a platform for supporting Open Data initiative of Government of India.(1) The portal is intended to be used by Government of India Ministries Departments their organizations to publish datasets, documents, services, tools and applications collected by them for public use and Makaan.com is an online real estate portal in India. It has a rating system for brokers which provides prices for Chennai area. (2) These datas are restructured to csv file for easier manipulation and reading (https://www.dropbox.com/s/rnio8fsvbx10qdt/Chennai-P.csv?dl=1).

    (1)https://data.gov.in/resources/primary-census-abstract-chennai-district-tamil-nadu-2001
    (2)https://www.makaan.com/price-trends/property-rates-for-buy-in-chennai

Another aspect to consider for this project is the Foursquare data. I believe that the data as good as provided, meaning although we are using Foursquare data for segmentation and clustering, the amount and accuracy of data captured can't 100% determine correct classification in real world.

To start, let's get and look at the data. So let's read it (https://www.dropbox.com/s/rnio8fsvbx10qdt/Chennai-P.csv?dl=1) and load it to dataframe:


Creating a dataframe for Place 1

In [1]:
#import the required library
import numpy as np
import pandas as pd

#read csv file contain KL data
df_kl = pd.read_csv("https://www.dropbox.com/s/rnio8fsvbx10qdt/Chennai-P.csv?dl=1")
df_kl.head()

Unnamed: 0,VillageName,Corporation,District,Latitude,Longitude,PricePerSqFt,TotalPersons,TotalMales,TotalFemales
0,Kodungaiyur (West),Tondiarpet,Chennai,13.1375,80.2478,3571,57723,29449,28274
1,Kodungaiyur (East),Tondiarpet,Chennai,13.1301,80.2572,3643,50385,25836,24549
2,Tondiarpet,Tondiarpet,Chennai,13.1272,80.29,6700,32373,16511,15862
3,Royapuram,Basin bridge,Chennai,13.1014,80.2704,10300,15719,7931,7788
4,Vyasarpadi (South),Basin bridge,Chennai,13.11111,80.26472,5294,37155,18900,18255


Creating a dataframe for Place 2

In [2]:
#read and load JB data
df_jb = pd.read_csv("https://www.dropbox.com/s/rnio8fsvbx10qdt/Chennai-P.csv?dl=1")
df_jb.head()

Unnamed: 0,VillageName,Corporation,District,Latitude,Longitude,PricePerSqFt,TotalPersons,TotalMales,TotalFemales
0,Kodungaiyur (West),Tondiarpet,Chennai,13.1375,80.2478,3571,57723,29449,28274
1,Kodungaiyur (East),Tondiarpet,Chennai,13.1301,80.2572,3643,50385,25836,24549
2,Tondiarpet,Tondiarpet,Chennai,13.1272,80.29,6700,32373,16511,15862
3,Royapuram,Basin bridge,Chennai,13.1014,80.2704,10300,15719,7931,7788
4,Vyasarpadi (South),Basin bridge,Chennai,13.11111,80.26472,5294,37155,18900,18255


Sorting the data based on population to choose two most populated area in the Chennai city

In [3]:
df_kl[['VillageName', 'Corporation', 'District', 'Latitude', 'Longitude', 'TotalPersons', 'TotalMales', 'TotalFemales']].sort_values('TotalPersons', ascending=False).nlargest(10, 'TotalPersons')

Unnamed: 0,VillageName,Corporation,District,Latitude,Longitude,TotalPersons,TotalMales,TotalFemales
40,Thiruvanmiyur (West),Mylapore,Chennai,12.9794,80.2494,95818,49109,46709
41,Ramapuram,Mylapore,Chennai,13.0317,80.1817,78007,39596,38411
13,Kulathur,Ayanavaram,Chennai,13.12361,80.21278,74363,38016,36347
14,Villivakkam (South),Kilpauk,Chennai,13.1101,80.2061,68502,34726,33776
15,Virugambakkam (North),Kilpauk,Chennai,13.049557,80.184928,68185,34625,33560
16,Anna Nagar (West),Kilpauk,Chennai,13.0862,80.2018,68054,35310,32744
0,Kodungaiyur (West),Tondiarpet,Chennai,13.1375,80.2478,57723,29449,28274
1,Kodungaiyur (East),Tondiarpet,Chennai,13.1301,80.2572,50385,25836,24549
33,Saidapet (west),Saidapet,Chennai,13.0217,80.2168,50264,25935,24329
23,Aminjikarai (west),Kilpauk,Chennai,13.083333,80.233333,46416,24140,22276


Based on the data of Latitude and Longitude for both places, we can now create map with pointed area in it.

In [5]:
!pip install folium
from geopy.geocoders import Nominatim
import folium

address = 'Chennai, India'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of New York using latitude and longitude values
map_kl = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_kl['Latitude'], df_kl['Longitude'], df_kl['VillageName'], df_kl['Corporation']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_kl)  
    
map_kl

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/72/ff/004bfe344150a064e558cb2aedeaa02ecbf75e60e148a55a9198f0c41765/folium-0.10.0-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 275kB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.10.0




In [26]:
address = 'Chennai, India'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of New York using latitude and longitude values
map_jb = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_jb['Latitude'], df_jb['Longitude'], df_jb['VillageName'], df_jb['Corporation']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_jb)  
    
map_jb

  from ipykernel import kernelapp as app


Methodology

In this project, I will use the basic methodology as taught in Week 3 lab.

    Above, we have done convert addresses into their equivalent latitude and longitude values.
    Then we will use the Foursquare API to explore neighborhoods in both places, Thiruvanmiyur and Ramapuram
    After that, explore function to get the most common venue categories in each neighborhood,
    and then use this feature to group the neighborhoods into clusters

K-means clustering algorithm will be use to complete this task. And also, the Folium library to visualize the neighborhoods in Chennai and their emerging clusters.
Based on dataframe analysis above, we found out that Thiruvanmiyur area in Chennai and Ramapuram area in Chennai have the highest number of population.

In [7]:
#slice the original dataframe and create a new dataframe of the Bukit Bintang
bbintang = df_kl[df_kl['VillageName'] == 'Thiruvanmiyur  (West)'].reset_index(drop=True)

#get the geographical coordinates of Bukit Bintang, Kuala Lumpur
address = 'Thiruvanmiyur, Chennai'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Bukit Bintang using latitude and longitude values
map_bintang = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(bbintang['Latitude'], bbintang['Longitude'], bbintang['Corporation']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_bintang)  
    
map_bintang



In [8]:
#slice the original dataframe and create a new dataframe of the Iskandar
jdt = df_jb[df_jb['VillageName'] == 'Ramapuram'].reset_index(drop=True)

#get the geographical coordinates of Manhattan
address = 'Ramapuram, Chennai'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Bukit Bintang using latitude and longitude values
map_jdt = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(jdt['Latitude'], jdt['Longitude'], jdt['Corporation']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_jdt)  
    
map_jdt



Using Foursquare API to get venues at surounding area of both Thiruvanmiyur and Ramapuram.

In [10]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#Define Foursquare Credentials and Version
CLIENT_ID = 'S34EAXF4QDSSZSRGKUHWA25K4ANQXEARFSR4ZI3W1EMBYZXW' # your Foursquare ID
CLIENT_SECRET = 'GV3ILPWKD2ETMZOLFMPRA0S3ORTYEQZAYMJA3RM2XN32OWVY' # your Foursquare Secret
VERSION = '20180604'

#explore the first neighborhood in our dataframe
#Get the neighborhood's latitude and longitude values.
neighborhood_latitude = bbintang.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = bbintang.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = bbintang.loc[0, 'VillageName'] # neighborhood name

#get the top 100 venues that are in Bukit Bintang within a radius of 500 meters
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#Send the GET request and examine the resutls
results = requests.get(url).json()

#borrow the get_category_type function from the Foursquare lab.
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('{} venues were returned by Foursquare for Thiruvanmiyur, Chennai.'.format(nearby_venues.shape[0]))
nearby_venues.head()

5 venues were returned by Foursquare for Thiruvanmiyur, Chennai.


Unnamed: 0,name,categories,lat,lng
0,Holiday Inn Chennai OMR IT Expressway,Hotel,12.980204,80.252869
1,OMR,Hostel,12.978376,80.252048
2,Chicago Cool Bar,Juice Bar,12.980315,80.252097
3,SRP Tools,Bus Station,12.980251,80.252282
4,Chicago Tea Kadai,Café,12.983136,80.25177


In [53]:
#explore the first neighborhood in our dataframe
#Get the neighborhood's latitude and longitude values.
neighborhood_latitude = jdt.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = jdt.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = jdt.loc[0, 'VillageName'] # neighborhood name

#get the top 100 venues that are in Marble Hill within a radius of 500 meters
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

#Send the GET request and examine the resutls
results = requests.get(url).json()

#clean the json and structure it into a pandas dataframe
venues = results['response']['groups'][0]['items']    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('{} venues were returned by Foursquare for Ramapuram.'.format(nearby_venues.shape[0]))
nearby_venues.head()

1 venues were returned by Foursquare for Ramapuram.


Unnamed: 0,name,categories,lat,lng
0,Café Coffee Day,Café,13.03174,80.186109


In [11]:
#function to repeat the same process to all area
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['VillageName', 
                  'Area Latitude', 
                  'Area Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#run the above function on each neighborhood and create a new dataframe
bintang_venues = getNearbyVenues(names=bbintang['VillageName'],
                                   latitudes=bbintang['Latitude'],
                                   longitudes=bbintang['Longitude']
                                  )

#check the size of the resulting dataframe
print(bintang_venues.shape)
bintang_venues.head()

Thiruvanmiyur  (West)
(5, 7)


Unnamed: 0,VillageName,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Thiruvanmiyur (West),12.9794,80.2494,Holiday Inn Chennai OMR IT Expressway,12.980204,80.252869,Hotel
1,Thiruvanmiyur (West),12.9794,80.2494,OMR,12.978376,80.252048,Hostel
2,Thiruvanmiyur (West),12.9794,80.2494,Chicago Cool Bar,12.980315,80.252097,Juice Bar
3,Thiruvanmiyur (West),12.9794,80.2494,SRP Tools,12.980251,80.252282,Bus Station
4,Thiruvanmiyur (West),12.9794,80.2494,Chicago Tea Kadai,12.983136,80.25177,Café


In [12]:
#run the above function on each neighborhood and create a new dataframe
jdt_venues = getNearbyVenues(names=jdt['VillageName'],
                                   latitudes=jdt['Latitude'],
                                   longitudes=jdt['Longitude']
                                  )

#check the size of the resulting dataframe
print(jdt_venues.shape)
jdt_venues.head()

Ramapuram
(4, 7)


Unnamed: 0,VillageName,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Ramapuram,13.0317,80.1817,Café Coffee Day,13.03174,80.186109,Café
1,Ramapuram,13.0317,80.1817,Hotel Jordan,13.031587,80.180475,BBQ Joint
2,Ramapuram,13.0317,80.1817,Rama's Cafe,13.03157,80.18254,Indian Restaurant
3,Ramapuram,13.0317,80.1817,Clave Oven,13.031619,80.181461,Bakery


In [60]:
#check how many venues were returned for each area
print('There are {} uniques categories in Thiruvanmiyur.'.format(len(bintang_venues['Venue Category'].unique())))
bintang_venues.groupby('VillageName').count()

There are 5 uniques categories in Thiruvanmiyur.


Unnamed: 0_level_0,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
VillageName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thiruvanmiyur (West),5,5,5,5,5,5


In [61]:
#check how many venues were returned for each area
print('There are {} uniques categories in Ramapuram.'.format(len(jdt_venues['Venue Category'].unique())))
jdt_venues.groupby('VillageName').count()

There are 1 uniques categories in Ramapuram.


Unnamed: 0_level_0,Area Latitude,Area Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
VillageName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ramapuram,1,1,1,1,1,1


### Analyze Thiruvanmiyur

In [14]:
# one hot encoding
bintang_onehot = pd.get_dummies(bintang_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
bintang_onehot['VillageName'] = bintang_venues['VillageName'] 

# move neighborhood column to the first column
fixed_columns = [bintang_onehot.columns[-1]] + list(bintang_onehot.columns[:-1])
bintang_onehot = bintang_onehot[fixed_columns]

#examine the new dataframe size after one hot encoding
print('{} rows were returned after one hot encoding.'.format(bintang_onehot.shape[0]))

#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
bintang_grouped = bintang_onehot.groupby('VillageName').mean().reset_index()

#examine the new dataframe size after one hot encoding
print('{} rows were returned after grouping.'.format(bintang_grouped.shape[0]))

5 rows were returned after one hot encoding.
1 rows were returned after grouping.


In [15]:
#print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in bintang_grouped['VillageName']:
    print("----"+hood+"----")
    temp = bintang_grouped[bintang_grouped['VillageName'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Thiruvanmiyur  (West)----
         venue  freq
0  Bus Station   0.2
1         Café   0.2
2       Hostel   0.2
3        Hotel   0.2
4    Juice Bar   0.2




In [18]:
#put into a pandas dataframe

#write a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 1

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['VillageName']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
areas_venues_sorted = pd.DataFrame(columns=columns)
areas_venues_sorted['VillageName'] = bintang_grouped['VillageName']

for ind in np.arange(bintang_grouped.shape[0]):
    areas_venues_sorted.iloc[ind, 1:] = return_most_common_venues(bintang_grouped.iloc[ind, :], num_top_venues)

areas_venues_sorted.head()

Unnamed: 0,VillageName,1st Most Common Venue
0,Thiruvanmiyur (West),Juice Bar


In [19]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 1

bintang_grouped_clustering = bintang_grouped.drop('VillageName', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bintang_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:1] 

#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
bintang_merged = bbintang

# add clustering labels
bintang_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
bintang_merged = bintang_merged.join(areas_venues_sorted.set_index('VillageName'), on='VillageName')

bintang_merged.head()

Unnamed: 0,VillageName,Corporation,District,Latitude,Longitude,PricePerSqFt,TotalPersons,TotalMales,TotalFemales,Cluster Labels,1st Most Common Venue
0,Thiruvanmiyur (West),Mylapore,Chennai,12.9794,80.2494,16250,95818,49109,46709,0,Juice Bar


In [20]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Finally, let's visualize the resulting clusters
# create map 3.1343385, 101.6863371
bb_clusters = folium.Map(location=[12.9794, 80.2494], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bintang_merged['Latitude'], bintang_merged['Longitude'], bintang_merged['VillageName'], bintang_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(bb_clusters)
       
bb_clusters

### Analyze Ramapuram

In [21]:
# one hot encoding
jdt_onehot = pd.get_dummies(jdt_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
jdt_onehot['VillageName'] = jdt_venues['VillageName'] 

# move neighborhood column to the first column
fixed_columns = [jdt_onehot.columns[-1]] + list(jdt_onehot.columns[:-1])
jdt_onehot = jdt_onehot[fixed_columns]

#examine the new dataframe size after one hot encoding
print('{} rows were returned after one hot encoding.'.format(jdt_onehot.shape[0]))

#group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
jdt_grouped = jdt_onehot.groupby('VillageName').mean().reset_index()

#examine the new dataframe size after one hot encoding
print('{} rows were returned after grouping.'.format(jdt_grouped.shape[0]))

4 rows were returned after one hot encoding.
1 rows were returned after grouping.


In [22]:
#print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in jdt_grouped['VillageName']:
    print("----"+hood+"----")
    temp = jdt_grouped[jdt_grouped['VillageName'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Ramapuram----
               venue  freq
0          BBQ Joint  0.25
1             Bakery  0.25
2               Café  0.25
3  Indian Restaurant  0.25




In [23]:
#create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 1

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['VillageName']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
areas_venues_sorted = pd.DataFrame(columns=columns)
areas_venues_sorted['VillageName'] = jdt_grouped['VillageName']

for ind in np.arange(jdt_grouped.shape[0]):
    areas_venues_sorted.iloc[ind, 1:] = return_most_common_venues(jdt_grouped.iloc[ind, :], num_top_venues)

areas_venues_sorted.head()

Unnamed: 0,VillageName,1st Most Common Venue
0,Ramapuram,Indian Restaurant


### K-mean Cluster Ramapuram

In [24]:
# set number of clusters
kclusters = 1

jdt_grouped_clustering = jdt_grouped.drop('VillageName', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(jdt_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

#create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
jdt_merged = jdt

# add clustering labels
jdt_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
jdt_merged = jdt_merged.join(areas_venues_sorted.set_index('VillageName'), on='VillageName')

jdt_merged.head() # check the last columns!

Unnamed: 0,VillageName,Corporation,District,Latitude,Longitude,PricePerSqFt,TotalPersons,TotalMales,TotalFemales,Cluster Labels,1st Most Common Venue
0,Ramapuram,Mylapore,Chennai,13.0317,80.1817,9126,78007,39596,38411,0,Indian Restaurant


In [25]:
#Finally, let's visualize the resulting clusters
# create map
jdt_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(jdt_merged['Latitude'], jdt_merged['Longitude'], jdt_merged['VillageName'], jdt_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(jdt_clusters)
       
jdt_clusters

### Results

In [77]:
#Cluster 1 for Thiruvanmiyur
bintang_merged.loc[bintang_merged['Cluster Labels'] == 0, bintang_merged.columns[[2] + list(range(5, bintang_merged.shape[1]))]]

Unnamed: 0,District,PricePerSqFt,TotalPersons,TotalMales,TotalFemales,Cluster Labels,1st Most Common Venue
0,Chennai,16250,95818,49109,46709,0,Juice Bar


In [78]:
#Cluster 2 for Thiruvanmiyur
bintang_merged.loc[bintang_merged['Cluster Labels'] == 1, bintang_merged.columns[[2] + list(range(5, bintang_merged.shape[1]))]]

Unnamed: 0,District,PricePerSqFt,TotalPersons,TotalMales,TotalFemales,Cluster Labels,1st Most Common Venue


In [79]:
#Cluster 1 for Ramapuram
jdt_merged.loc[jdt_merged['Cluster Labels'] == 0, jdt_merged.columns[[2] + list(range(5, jdt_merged.shape[1]))]]

Unnamed: 0,District,PricePerSqFt,TotalPersons,TotalMales,TotalFemales,Cluster Labels,1st Most Common Venue
0,Chennai,9126,78007,39596,38411,0,Café


In [80]:
#Cluster 2 for Ramapuram
jdt_merged.loc[jdt_merged['Cluster Labels'] == 1, jdt_merged.columns[[2] + list(range(5, jdt_merged.shape[1]))]]

Unnamed: 0,District,PricePerSqFt,TotalPersons,TotalMales,TotalFemales,Cluster Labels,1st Most Common Venue


### Discussion

Based on cluster for each places above, we believe that classification for each cluster can be done better with calculation of venues categories (most common) in each cities. Refering to each clsuter, we can't deterimine clearly what represent in each cluster by using Foursquare - Most Common Venue data.

However, for the sae of this project we assumed each cluster as follow:

    Cluster 1: Thiruvanmiyur: Hotel
    Cluster 2: Thiruvanmiyur: Hostel
    Cluster 3: Thiruvanmiyur: Juice Bar
    Cluster 1: Ramapuram: Cafe coffee Bar
    
What is lacking at this point is a systematic, quantitative way to identify and distinguish different district and to describe the correlation most common venues as recorded in Foursquare. The reality is however more complex: similar cities might have or might not have similar common venues. A further step in this classification would be to find a method to extract these common venues and integrate the spatial correlations between different of areas or district.

We believe that the classification we propose is an encouraging step towards a quantitative and systematic comparison of the different places. Further studies are indeed needed in order to relate the data acquired, then observe it to more meaningful and objective results.

### Conclusion

Using Foursquare API, we can capture data of common places all around the world. Using it, we refer back to our main objectives, which is to determine;

    the similarity or dissimilarity of both places
    classification of area located inside the city whether it is hotel, hostel, juice bar, coffee bar or others

In conclusion, both places Thiruvanmiyur and Ramapuram are the center of attraction among Chennaites. However, to declare both places are similar or dissimilar base on common venues visited is quite difficult. Both places is similar in some venues also dissimilar in certain venues. And for classitification based on common venues, again we must have more systematic or quantitative way to identify and declare this. Comparison can be made, but no such method or quantitative data to determine this. We hope in the future, a method to determine it can be establish and explore for references.

Thank you.

by
Bency Sherin G V
bencysherin@gmail.com