# Final Project - THE BATTLE OF NEIGHBOURHOOD
## Applied Data Science Capstone by IBM/Coursera

## Introduction: Business Problem

This project aims to select or to suggest the best place in <b>India</b> for the targeted people to get settled. 

The report will be targeted to people who are looking to <b>settle in India.</b>

For settlement, or to find a neighbourhood to hunt for an apartment, we will be focusing on the <b>four metropolitan cities</b> only - namely, <b>Delhi, Mumbai, Chennai and Kolkata.</b>

We will explore the neighbourhood of all four cities and finally cluster the neighbourhoods using k-mean clustering.

We will suggest different cities on basis of their preferences and the atmosphere.

# Data

Based on the definition of the problem,factors that will influence our decisons is "Most common places in Neighbourhood of the cities"

Following data sources will be used to extract/generate the information:-

Part1:Preprocessing a data set to get pin codes and location information from https://data.gov.in/resources/all-india-pincode-directory-contact-details-along-latitude-and-longitude 

Part2: Getting coordinates of some places that are not included in dataset using Google Maps API Geocoding

Part3: Using Foursquare API to get neighbourhood data.

### Importing Required Libraries

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('mode.chained_assignment', None)

import json # library to handle JSON files

import re#Library to find particular pattern
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import time #To create delays
# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## Preprocessing the dataset

Loading Data in pandas dataframe form downloaded file

Dataset Location:-https://data.gov.in/resources/all-india-pincode-directory-contact-details-along-latitude-and-longitude 

This is the dataset prepared by Indian Government that contains pincode directory of all areas in India

### Reading the dataset

In [2]:
# Reading in the Data
main_df=pd.read_csv("Desktop/all_india_PO_list_without_APS_offices_ver2_lat_long.csv")

In [3]:
# View top 5 rows of the dataset
main_df.head()

Unnamed: 0,officename,pincode,officeType,Deliverystatus,divisionname,regionname,circlename,Taluk,Districtname,statename,Telephone,Related Suboffice,Related Headoffice,longitude,latitude
0,Achalapur B.O,504273,B.O,Delivery,Adilabad,Hyderabad,Andhra Pradesh,Asifabad,Adilabad,TELANGANA,,Rechini S.O,Mancherial H.O,,
1,Ada B.O,504293,B.O,Delivery,Adilabad,Hyderabad,Andhra Pradesh,Asifabad,Adilabad,TELANGANA,,Asifabad S.O,Mancherial H.O,,
2,Adegaon B.O,504307,B.O,Delivery,Adilabad,Hyderabad,Andhra Pradesh,Boath,Adilabad,TELANGANA,,Echoda S.O,Adilabad H.O,,
3,Adilabad Collectorate S.O,504001,S.O,Non-Delivery,Adilabad,Hyderabad,Andhra Pradesh,Adilabad,Adilabad,TELANGANA,08732-226703,,Adilabad H.O,,
4,Adilabad H.O,504001,H.O,Delivery,Adilabad,Hyderabad,Andhra Pradesh,Adilabad,Adilabad,TELANGANA,08732-226738,,,,


### Removing unnecessary Columns

In [4]:
# A list of required columns
required_columns=['pincode','Taluk','longitude','latitude']

In [5]:
# Making a new dataframe from old one extracting the required dataframes only
df_trun=main_df[required_columns]

In [6]:
# View top 5 rows of the dataframe
df_trun.head()

Unnamed: 0,pincode,Taluk,longitude,latitude
0,504273,Asifabad,,
1,504293,Asifabad,,
2,504307,Boath,,
3,504001,Adilabad,,
4,504001,Adilabad,,


In [7]:
# Checking No. of entries in the dataframe
df_trun.shape

(154797, 4)

### Removing unnecessary entries

Here the dataset contains approx 1.5 lakh rows but we want the data of metropolitan cities only.

In [8]:
#Making a new data frame containing all relevant entries
df_metro=df_trun[df_trun.Taluk.str.contains('Kolkata',na=False) | df_trun.Taluk.str.contains('Delhi',na=False) \
                 | df_trun.Taluk.str.contains('Chennai',na=False) | df_trun.Taluk.str.contains('Mumbai',na=False)]

In [9]:
# Viewing first five rows of dataset
df_metro.head()

Unnamed: 0,pincode,Taluk,longitude,latitude
32386,110090,East Delhi,,
32398,110053,East Delhi,,
32439,110094,Delhi North East,,
32459,110006,Delhi,,
32460,110033,Delhi,,


In [10]:
#Checking Shape of Dataset
df_metro.shape

(800, 4)

In [11]:
# Checking for missing values of Pincode
df_metro.isnull()['pincode'].value_counts()

False    800
Name: pincode, dtype: int64

In [12]:
# Verifying if the pincode is not repeated (They wll be required to fill the missing data)
len(df_metro['pincode'].unique())

234

In [13]:
#Dropping Duplicate Entries
df_metro.drop_duplicates(subset ="pincode",inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [14]:
#Resetting the index
df_metro.reset_index(drop=True,inplace=True)
#Viewing first 10 rows of dataframe
df_metro.head(10)

Unnamed: 0,pincode,Taluk,longitude,latitude
0,110090,East Delhi,,
1,110053,East Delhi,,
2,110094,Delhi North East,,
3,110006,Delhi,,
4,110033,Delhi,,
5,110036,Delhi,,
6,110034,Delhi,,
7,110052,Delhi,,
8,110039,Delhi,,
9,110042,Delhi,,


In [15]:
df_metro.shape

(234, 4)

### Filling missing values

In [16]:
# Checking for missing values of Coordinates
df_metro.isnull()['longitude'].value_counts()

True    234
Name: longitude, dtype: int64

Here we don't have Coordinates for any of the entry.So we have to fetch them using Google Geocoding API.

#### Using Geopy to get the location coordinates

In [17]:
geolocator = Nominatim(user_agent="kb636an@gmail.com")
def get_coords(pincode):
    location = geolocator.geocode(pincode)
    latitude = location.latitude
    longitude = location.longitude
    address = location.address
    name = re.findall('^.+Tehsil',location.address)
    if len(name)==0:
        name=address
    else:
        name=name[0]
    return name,latitude,longitude

#### Filling the dataframe with new values

In [18]:
#Function to get required data
for i in range(len(df_metro)):
    pincode=df_metro.loc[i]['pincode']
    try:
        data=get_coords(pincode)
        df_metro.loc[i,'Neighborhood']=data[0]
        df_metro.loc[i,'latitude']=data[1]
        df_metro.loc[i,'longitude']=data[2]
    except:
        pass

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [19]:
#Viewing top 5 rows of the dataframe
df_metro.head()

Unnamed: 0,pincode,Taluk,longitude,latitude,Neighbourhood
0,110090,East Delhi,-34.926281,-8.01663,"110090, Rua Nova Descoberta, Bolo de Noiva, No..."
1,110053,East Delhi,77.27982,28.691913,"Babarpur, Shahdara Tehsil"
2,110094,Delhi North East,77.272664,28.716206,Yamuna Vihar Tehsil
3,110006,Delhi,77.231623,28.655984,"Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil"
4,110033,Delhi,77.168477,28.727506,"Jahangirpuri Colony, Model Town Tehsil"


In [20]:
#Saving the data to csv
df_metro.to_csv("India_Metropolitan.csv")

Here we are seaching for coordinates in India but some entries contain garbage values.

We will drop all the rows that contain inappropriate entries

#### Dropping entries with inappropriate entries

In [23]:
df_metro

Unnamed: 0,pincode,Taluk,longitude,latitude,Neighbourhood
0,110090,East Delhi,-34.926281,-8.01663,"110090, Rua Nova Descoberta, Bolo de Noiva, No..."
1,110053,East Delhi,77.27982,28.691913,"Babarpur, Shahdara Tehsil"
2,110094,Delhi North East,77.272664,28.716206,Yamuna Vihar Tehsil
3,110006,Delhi,77.231623,28.655984,"Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil"
4,110033,Delhi,77.168477,28.727506,"Jahangirpuri Colony, Model Town Tehsil"
5,110036,Delhi,77.168828,28.817383,Alipur Tehsil
6,110034,Delhi,77.136121,28.69456,Saraswati Vihar Tehsil
7,110052,Delhi,77.175292,28.683588,"Ashok Vihar - IV, Saraswati Vihar Tehsil"
8,110039,Delhi,77.04211,28.797887,Narela Tehsil
9,110042,Delhi,77.107341,28.744966,"Sector 17, Rohini, Alipur Tehsil"


In [24]:
#The values that needs to be corrected are:-
drop_rows=[0,13,17,25,27,29,30,33,39,40,56,61,77,78,79,80,83,84,88,90,91,94,95,96,103,106,109,112,113,116,120,148,160,163,168,171,172,180,193,203,209,213,218,221]

In [28]:
#Dropping the garbage entries
df_metro.drop(df_metro.index[drop_rows],inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [30]:
#Resetting the index
df_metro.reset_index(drop=True,inplace=True)

In [31]:
# Viewing the first five rows of data frame
df_metro.head()

Unnamed: 0,pincode,Taluk,longitude,latitude,Neighbourhood
0,110053,East Delhi,77.27982,28.691913,"Babarpur, Shahdara Tehsil"
1,110094,Delhi North East,77.272664,28.716206,Yamuna Vihar Tehsil
2,110006,Delhi,77.231623,28.655984,"Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil"
3,110033,Delhi,77.168477,28.727506,"Jahangirpuri Colony, Model Town Tehsil"
4,110036,Delhi,77.168828,28.817383,Alipur Tehsil


## Visualising the map

We will be using this function to create the map

In [48]:
def mapcreator(latitude,longitude,zoom,df):
    map_=folium.Map(location=[latitude,longitude],zoom_start=zoom)
    for lat, lng, location, neighborhood in zip(df['latitude'], df['longitude'], df['Taluk'], df['Neighborhood']):
        label = '{}, {}'.format(neighborhood, location)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_)
    return map_

In [49]:
#Finding Coordinates of India
_,latitude,longitude=get_coords('India')

In [50]:
map_india =mapcreator(latitude,longitude,5,df_metro)
map_india

## Modelling the Data

### Now,we will distribute the data in four different groups(on basis of the city they belong to) and perform the similar operations on all of them.

#### For Delhi

In [51]:
#Making a new dataframe containing areas of delhi only
df_delhi=df_metro[df_metro.Taluk.str.contains('Delhi',na=False)]

In [55]:
#Dropping pincode column as no longer needed
df_delhi.drop(['pincode'],axis=1,inplace=True)
#Viewing first 5 rows
df_delhi.head()

Unnamed: 0,Taluk,longitude,latitude,Neighbourhood
0,East Delhi,77.27982,28.691913,"Babarpur, Shahdara Tehsil"
1,Delhi North East,77.272664,28.716206,Yamuna Vihar Tehsil
2,Delhi,77.231623,28.655984,"Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil"
3,Delhi,77.168477,28.727506,"Jahangirpuri Colony, Model Town Tehsil"
4,Delhi,77.168828,28.817383,Alipur Tehsil


In [56]:
#Checking No. of Entries
df_delhi.shape

(58, 4)

Let's get coordinates of delhi

In [58]:
#Finding Coordinates of Delhi
_,latitude,longitude=get_coords('Delhi,India')

Creating a map of Delhi

In [65]:
map_delhi =mapcreator(latitude,longitude,11,df_delhi)
map_delhi

#### For Mumbai

In [68]:
#Making a new dataframe containing areas of delhi only
df_mumbai=df_metro[df_metro.Taluk.str.contains('Mumbai',na=False)]

In [69]:
#Dropping pincode column as no longer needed
df_mumbai.drop(['pincode'],axis=1,inplace=True)
#Viewing first 5 rows
df_mumbai.head()

Unnamed: 0,Taluk,longitude,latitude,Neighbourhood
58,Mumbai,72.867622,19.023074,"F/N Ward, Zone 2, Mumbai, Mumbai City, Maharas..."
59,Mumbai,72.842493,18.996311,"F/S Ward, Zone 2, Mumbai, Mumbai City, Maharas..."
60,Mumbai,72.840388,18.98178,"E Ward, Zone 1, Mumbai, Mumbai City, Maharasht..."
61,Mumbai,72.846936,19.010619,"F/S Ward, Zone 2, Mumbai, Mumbai City, Maharas..."
62,Mumbai,72.837844,18.968523,"E Ward, Zone 1, Mumbai, Mumbai City, Maharasht..."


In [71]:
#Checking No. of Entries
df_mumbai.shape

(35, 4)

Let's get coordinates of mumbai

In [73]:
#Finding Coordinates of Mumbai
_,latitude,longitude=get_coords('Mumbai,India')

Creating a map of Mumbai

In [74]:
map_mumbai =mapcreator(latitude,longitude,11,df_mumbai)
map_mumbai

#### For Chennai

In [76]:
#Making a new dataframe containing areas of delhi only
df_chennai=df_metro[df_metro.Taluk.str.contains('Chennai',na=False)]

In [77]:
#Dropping pincode column as no longer needed
df_chennai.drop(['pincode'],axis=1,inplace=True)
#Viewing first 5 rows
df_chennai.head()

Unnamed: 0,Taluk,longitude,latitude,Neighbourhood
93,Chennai,80.267927,13.071184,"Ward 63, Zone 5 Royapuram, சென்னை - Chennai, C..."
94,Chennai,80.262137,13.052888,"Ward 118, Zone 9 Teynampet, சென்னை - Chennai, ..."
95,Chennai,80.22585,13.064678,"Ward 109, Zone 9 Teynampet, சென்னை - Chennai, ..."
96,Chennai,80.256225,13.048387,"Ward 111, Zone 9 Teynampet, சென்னை - Chennai, ..."
97,Chennai,80.258069,13.058448,"Ward 111, Zone 9 Teynampet, சென்னை - Chennai, ..."


In [79]:
#Checking No. of Entries
df_chennai.shape

(32, 4)

Let's get coordinates of chennai

In [80]:
#Finding Coordinates of Delhi
_,latitude,longitude=get_coords('Chennai,India')

Creating a map of Chennai

In [81]:
map_chennai =mapcreator(latitude,longitude,11,df_chennai)
map_chennai

#### For Kolkata

In [82]:
#Making a new dataframe containing areas of delhi only
df_kolkata=df_metro[df_metro.Taluk.str.contains('Kolkata',na=False)]

In [83]:
#Dropping pincode column as no longer needed
df_kolkata.drop(['pincode'],axis=1,inplace=True)
#Viewing first 5 rows
df_kolkata.head()

Unnamed: 0,Taluk,longitude,latitude,Neighbourhood
125,Kolkata,88.329496,22.530436,"Kolkata, West Bengal, 700027, India"
126,Kolkata,88.358539,22.58327,"Kolkata, West Bengal, 700007, India"
127,Kolkata,88.348456,22.537944,"Kolkata, West Bengal, 700020, India"
128,Kolkata,88.379909,22.472658,"Sonarpur, South 24 Parganas, West Bengal, 7000..."
129,Kolkata,88.329553,22.551956,"Kolkata, West Bengal, 700022, India"


In [84]:
#Checking No. of Entries
df_kolkata.shape

(66, 4)

Let's get coordinates of kolkata

In [85]:
#Finding Coordinates of Kolkata
_,latitude,longitude=get_coords('Kolkata,India')

Creating a map of Kolkata

In [86]:
map_kolkata =mapcreator(latitude,longitude,11,df_kolkata)
map_kolkata

<b>Defining FourSquare Credentials</b>

In [88]:
CLIENT_ID = 'XQIP25QCLDT0ZQ5N1UGMXKRWU3WYWLNQF34MBU4HCKVWFTQJ' # your Foursquare ID
CLIENT_SECRET = 'PIVB4SW3YZNVO50EUTCV4GJYSNIKX5GXRWNB24IR2BWQDQY0' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XQIP25QCLDT0ZQ5N1UGMXKRWU3WYWLNQF34MBU4HCKVWFTQJ
CLIENT_SECRET:PIVB4SW3YZNVO50EUTCV4GJYSNIKX5GXRWNB24IR2BWQDQY0


<b> Let's explore the first Neighborhood in Delhi</b>

Get Neighborhood's Name

In [90]:
df_delhi.loc[0, 'Neighborhood']

'Babarpur, Shahdara Tehsil'

In [93]:
neighborhood_latitude = df_delhi.loc[0, 'latitude'] # neighborhood latitude value
neighborhood_longitude = df_delhi.loc[0, 'longitude'] # neighborhood longitude value

neighborhood_name = df_delhi.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Babarpur, Shahdara Tehsil are 28.6919134, 77.2798204134402.


<b>No, let's get top 125 venues that are in Babarpur,Shahdra with radius of 500 meters</b>

In [94]:
# type your answer here
# type your answer here
radius=500
LIMIT=125
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

Send the GET request and examine the results

In [95]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ed2f18171c428001b6d8760'},
  'headerLocation': 'Delhi',
  'headerFullLocation': 'Delhi',
  'headerLocationGranularity': 'city',
  'totalResults': 2,
  'suggestedBounds': {'ne': {'lat': 28.696413404500007,
    'lng': 77.28494071797424},
   'sw': {'lat': 28.687413395499995, 'lng': 77.27470010890617}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5beca792b23dfa002b29932e',
       'name': 'Royal Enfield Service Center',
       'location': {'address': 'Plot No 201/6',
        'crossStreet': 'Main Road Number 66',
        'lat': 28.69078,
        'lng': 77.2786,
        'labeledLatLngs': [{'label': 'display',
          'lat': 28.69078,
          'lng': 77.2786}],
        'distance': 173,
        'postalCode': '110053',
        'cc': 'IN

In [96]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [97]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Royal Enfield Service Center,Motorcycle Shop,28.69078,77.2786
1,yamuna vihar,Park,28.689816,77.283876


## Explore Neighborhoods

#### Let's create a function to repeat the same process to all the neighborhoods in that city

In [98]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        try:
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except:
            continue
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Venues in Delhi

In [None]:
delhi_venues = getNearbyVenues(names=df_delhi['Neighborhood'],
                                   latitudes=df_delhi['latitude'],
                                   longitudes=df_delhi['longitude']
                                  )

Babarpur, Shahdara Tehsil
Yamuna Vihar Tehsil
Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil
Jahangirpuri Colony, Model Town Tehsil
Alipur Tehsil
Saraswati Vihar Tehsil
Ashok Vihar - IV, Saraswati Vihar Tehsil
Narela Tehsil
Sector 17, Rohini, Alipur Tehsil
Banker, Narela Tehsil
Rohini Tehsil
Civil Lines Tehsil
Kanjhawalan Tehsil
Civil Lines, Lucknow, Sadar, Lucknow, Uttar Pradesh, 110054, India
Dhaka, Model Town Tehsil
Model Town Tehsil


In [None]:
Cheking Shape and the resulting dataframe

In [None]:
print(delhi_venues.shape)
delhi_venues.head()

Checking count of venues for each neighborhood

In [None]:
delhi_venues.groupby('Neighborhood').count()

##### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(delhi_venues['Venue Category'].unique())))

#### Venues in Mumbai

In [None]:
mumbai_venues = getNearbyVenues(names=df_mumbai['Neighborhood'],
                                   latitudes=df_mumbai['latitude'],
                                   longitudes=df_mumbai['longitude']
                                  )

Babarpur, Shahdara Tehsil
Yamuna Vihar Tehsil
Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil
Jahangirpuri Colony, Model Town Tehsil
Alipur Tehsil
Saraswati Vihar Tehsil
Ashok Vihar - IV, Saraswati Vihar Tehsil
Narela Tehsil
Sector 17, Rohini, Alipur Tehsil
Banker, Narela Tehsil
Rohini Tehsil
Civil Lines Tehsil
Kanjhawalan Tehsil
Civil Lines, Lucknow, Sadar, Lucknow, Uttar Pradesh, 110054, India
Dhaka, Model Town Tehsil
Model Town Tehsil


In [None]:
Cheking Shape and the resulting dataframe

In [None]:
print(mumbai_venues.shape)
mumbai_venues.head()

Checking count of venues for each neighborhood

In [None]:
mumbai_venues.groupby('Neighborhood').count()

##### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(mumbai_venues['Venue Category'].unique())))

#### Venues in Chennai

In [None]:
chennai_venues = getNearbyVenues(names=df_chennai['Neighborhood'],
                                   latitudes=df_chennai['latitude'],
                                   longitudes=df_chennai['longitude']
                                  )

Babarpur, Shahdara Tehsil
Yamuna Vihar Tehsil
Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil
Jahangirpuri Colony, Model Town Tehsil
Alipur Tehsil
Saraswati Vihar Tehsil
Ashok Vihar - IV, Saraswati Vihar Tehsil
Narela Tehsil
Sector 17, Rohini, Alipur Tehsil
Banker, Narela Tehsil
Rohini Tehsil
Civil Lines Tehsil
Kanjhawalan Tehsil
Civil Lines, Lucknow, Sadar, Lucknow, Uttar Pradesh, 110054, India
Dhaka, Model Town Tehsil
Model Town Tehsil


In [None]:
Cheking Shape and the resulting dataframe

In [None]:
print(chennai_venues.shape)
chennai_venues.head()

Checking count of venues for each neighborhood

In [None]:
chennai_venues.groupby('Neighborhood').count()

##### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(chennai_venues['Venue Category'].unique())))

#### Venues in Kolkata

In [None]:
kolkata_venues = getNearbyVenues(names=df_kolkata['Neighborhood'],
                                   latitudes=df_kolkata['latitude'],
                                   longitudes=df_kolkata['longitude']
                                  )

Babarpur, Shahdara Tehsil
Yamuna Vihar Tehsil
Chandni Chowk, Old Delhi, Delhi, Kotwali Tehsil
Jahangirpuri Colony, Model Town Tehsil
Alipur Tehsil
Saraswati Vihar Tehsil
Ashok Vihar - IV, Saraswati Vihar Tehsil
Narela Tehsil
Sector 17, Rohini, Alipur Tehsil
Banker, Narela Tehsil
Rohini Tehsil
Civil Lines Tehsil
Kanjhawalan Tehsil
Civil Lines, Lucknow, Sadar, Lucknow, Uttar Pradesh, 110054, India
Dhaka, Model Town Tehsil
Model Town Tehsil


In [None]:
Cheking Shape and the resulting dataframe

In [None]:
print(kolkata_venues.shape)
kolkata_venues.head()

In [None]:
Checking count of venues for each neighborhood

In [None]:
kolkata_venues.groupby('Neighborhood').count()

##### Let's find out how many unique categories can be curated from all the returned venues

In [None]:
print('There are {} uniques categories.'.format(len(kolkata_venues['Venue Category'].unique())))

### Analysing Neighborhoods in Delhi

In [None]:
# one hot encoding
delhi_onehot = pd.get_dummies(delhi_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
delhi_onehot['Neighborhood'] = delhi_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [delhi_onehot.columns[-1]] + list(delhi_onehot.columns[:-1])
delhi_onehot = delhi_onehot[fixed_columns]

delhi_onehot.head()

In [None]:
# Examining Shape of new dataframe
delhi_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
delhi_grouped = delhi_onehot.groupby('Neighborhood').mean().reset_index()
delhi_grouped

In [None]:
#Function to sort venues in descending order
#We will use this function in futyre too
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_delhi = pd.DataFrame(columns=columns)
neighborhoods_venues_delhi['Neighborhood'] = delhi_grouped['Neighborhood']

for ind in np.arange(delhi_grouped.shape[0]):
    neighborhoods_venues_delhi.iloc[ind, 1:] = return_most_common_venues(delhi_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_delhi.head()

### Analysing Neighborhoods in Mumbai

In [None]:
# one hot encoding
mumbai_onehot = pd.get_dummies(mumbai_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mumbai_onehot['Neighborhood'] = mumbai_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [mumbai_onehot.columns[-1]] + list(mumbai_onehot.columns[:-1])
mumbai_onehot = mumbai_onehot[fixed_columns]

mumbai_onehot.head()

In [None]:
# Examining Shape of new dataframe
mumbai_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
mumbai_grouped = mumbai_onehot.groupby('Neighborhood').mean().reset_index()
mumbai_grouped

Let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_mumbai = pd.DataFrame(columns=columns)
neighborhoods_venues_mumbai['Neighborhood'] = mumbai_grouped['Neighborhood']

for ind in np.arange(mumbai_grouped.shape[0]):
    neighborhoods_venues_mumbai.iloc[ind, 1:] = return_most_common_venues(mumbai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_mumbai.head()

### Analysing Neighborhoods in Chennai

In [None]:
# one hot encoding
chennai_onehot = pd.get_dummies(chennai_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
chennai_onehot['Neighborhood'] = chennai_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [chennai_onehot.columns[-1]] + list(chennai_onehot.columns[:-1])
chennai_onehot = chennai_onehot[fixed_columns]

chennai_onehot.head()

In [None]:
# Examining Shape of new dataframe
chennai_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
chennai_grouped = chennai_onehot.groupby('Neighborhood').mean().reset_index()
chennai_grouped

Let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_chennai = pd.DataFrame(columns=columns)
neighborhoods_venues_chennai['Neighborhood'] = chennai_grouped['Neighborhood']

for ind in np.arange(chennai_grouped.shape[0]):
    neighborhoods_venues_chennai.iloc[ind, 1:] = return_most_common_venues(chennai_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_chennai.head()

### Analysing Neighborhoods in Kolkata

In [None]:
# one hot encoding
kolkata_onehot = pd.get_dummies(kolkata_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kolkata_onehot['Neighborhood'] = kolkata_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kolkata_onehot.columns[-1]] + list(kolkata_onehot.columns[:-1])
kolkata_onehot = kolkata_onehot[fixed_columns]

kolkata_onehot.head()

In [None]:
# Examining Shape of new dataframe
kolkata_onehot.shape

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
kolkata_grouped = kolkata_onehot.groupby('Neighborhood').mean().reset_index()
kolkata_grouped

Let's create the new dataframe and display the top 10 venues for each neighborhood.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_kolkata = pd.DataFrame(columns=columns)
neighborhoods_venues_kolkata['Neighborhood'] = kolkata_grouped['Neighborhood']

for ind in np.arange(kolkata_grouped.shape[0]):
    neighborhoods_venues_kolkata.iloc[ind, 1:] = return_most_common_venues(kolkata_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_kolkata.head()

## Cluster Neighborhoods

### In Delhi

In [None]:
# Running k-means to cluster
# set number of clusters
kclusters = 5

delhi_grouped_clustering = delhi_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(delhi_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_delhi.insert(0, 'Cluster Labels', kmeans.labels_)

delhi_merged = df_delhi

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
delhi_merged = df_delhi.join(neighborhoods_venues_delhi.set_index('Neighborhood'), on='Neighborhood')

delhi_merged.head() # check the last columns!

Visualizing the cluster results

In [None]:
# create map
delhi_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(delhi_merged['latitude'], delhi_merged['longitude'], delhi_merged['Neighborhood'], delhi_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(delhi_clusters)
       
delhi_clusters

### In Mumbai

In [None]:
# Running k-means to cluster
# set number of clusters
kclusters = 5

mumbai_grouped_clustering = mumbai_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mumbai_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_mumbai.insert(0, 'Cluster Labels', kmeans.labels_)

mumbai_merged = df_dmumbai

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
mumbai_merged = mumbai.join(neighborhoods_venues_mumbai.set_index('Neighborhood'), on='Neighborhood')

mumbai_merged.head() # check the last columns!

Visualizing the cluster results

In [None]:
# create map
mumbai_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mumbai_merged['latitude'], mumbai_merged['longitude'], mumbai_merged['Neighborhood'], mumbai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(mumbai)
       
mumbai

### In Chennai

In [None]:
# Running k-means to cluster
# set number of clusters
kclusters = 5

chennai_grouped_clustering = chennai_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(chennai_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_chennai.insert(0, 'Cluster Labels', kmeans.labels_)

chennai_merged = df_chennai

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
chennai_merged = df_chennai.join(neighborhoods_venues_chennai.set_index('Neighborhood'), on='Neighborhood')

chennai_merged.head() # check the last columns!

Visualizing the cluster results

In [None]:
# create map
chennai_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(chennai_merged['latitude'], chennai_merged['longitude'], chennai_merged['Neighborhood'], chennai_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(chennai_clusters)
       
chennai_clusters

### In Kolkata

In [None]:
# Running k-means to cluster
# set number of clusters
kclusters = 5

kolkata_grouped_clustering = kolkata_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kolkata_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_kolkata.insert(0, 'Cluster Labels', kmeans.labels_)

kolkata_merged = df_kolkata

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
kolkata_merged = df_kolkata.join(neighborhoods_venues_kolkata.set_index('Neighborhood'), on='Neighborhood')

kolkata_merged.head() # check the last columns!

Visualizing the cluster results

In [None]:
# create map
kolkata_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kolkata_merged['latitude'], kolkata_merged['longitude'], kolkata_merged['Neighborhood'], kolkata_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(kolkata_clusters)
       
kolkata_clusters