<H1 align='Center'> Battle Of Neighbourhood</h1>
<br><br>

This is my first post as part of my IBM Professional Data Science Certification. This project involving the Foursquare API and various data tools such as wikipedia etc. The project allowed me to pull together many of the tools\libraries I learned in the course:

<br>
<ul>
<li> <font size=3>Jupyter notebook</font></li>
<li> <font size=3>pandas</font></li>
<li> <font size=3>folium</font></li>
<li> <font size=3>json</font></li>
<li> <font size=3>requests</font></li>
<li> <font size=3>bs4</font></li>
</ul>

<H3>Problem Statement</H3>
Indians are more welcoming now a days to drink in public rather than in their hideouts or home. Drinking is becoming norm at social gatherings. India has population of 1.3Billion people but still not enough bars to cater such a large customer base.
Finding the right place to open any restaurent and bar is always the key to business.
I am trying to build a model to forecast best locality to open the next bar in Mumbai(India).

<br><br>
<h3> Data set sources </h3>
<br> I have collected locality data from <b><u>wikipedia </u></b>and then used <b><u>foursquare apis</u></b> to collect venue details.
<br><br>

In [100]:
import folium
import pandas as pd
from bs4 import BeautifulSoup
import requests as rst
import numpy as np
import json


<H4> Data Colection to find the nerighbourhoods in Mumbai</h4>

In [101]:
def get_the_neighbourhoods():
    url = r'https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Mumbai'
    wiki_data = rst.get(url).text
    data = BeautifulSoup(wiki_data, 'html.parser')
    tbl = data.find('table',{'class':'wikitable sortable'})
    scrapped_data = []
    for row in tbl.find_all('tr'):
        scrapped_row = []
        for col in row.find_all('td'):
            if col:
                area = col.text.rstrip()
                scrapped_row.append(area)
        if scrapped_row:
            scrapped_data.append(scrapped_row)
    columns = ['Area', 'Location', 'Lattitide', 'Longitude']
    df = pd.DataFrame(scrapped_data, columns=columns)
    return df

<h4> Get he number of Bars in each neighbourhoods</h4>

In [102]:
def get_number_of_bars(neighborhood_latitude, neighborhood_longitude):
    '''This function will take longitude and latitude as input and return number of bars in that neighbourhood'''
    try:
        CLIENT_ID= 'QSODRMNFQSB1U4COV520ELCDD1TE5ZDQZ5LM3A33GOZFY5GC'
        CLIENT_SECRET='3WKOZZ3TOWLZGOVQXKHNCQDWDAXXP2QC3Y5IAXCXI4RU3Q5S'
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&query=Bar'
        VERSION = '20191121'
        url_new = url.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        neighborhood_latitude, 
        neighborhood_longitude)
        response = rst.get(url_new).json()["response"]
        total_results = response['totalResults']
    except exception as e:
        total_results = 0
    
    return total_results

In [103]:
#Parse the data and get the neighbourhoods
neighbourHoods = get_the_neighbourhoods()
scrapped_data = neighbourHoods[['Lattitide','Longitude']]
res =[]
for index, row in scrapped_data.iterrows():
    #for row in scrapped_data:
    total_results = get_number_of_bars(row['Lattitide'], row['Longitude'])
    #resp = json.loads(response.text)
    res.append(total_results)
    
neighbourHoods['total_pubs'] = res


<h4>Identify the right dataset</H4>
<br>Get the 20 neighbourhoods that has lowest numbe of Bars while  5 that has maximum number of bars in mumbai. I am trying to identify which out of 20 neighbourhoods i should choose to open the bar.

In [110]:
neighbourHoods = neighbourHoods.sort_values(by='total_pubs')
top_five_nh = neighbourHoods.head(20)
most_pub_nh = neighbourHoods.tail(5)
df_to_process = pd.concat([top_five_nh,most_pub_nh])
df_to_process

Unnamed: 0,Area,Location,Lattitide,Longitude,total_pubs
82,Hindu colony,"Dadar,South Mumbai",19.020841,19.020841,0
41,Nehru Nagar,"Kurla,Eastern Suburbs",15.451686,74.971977,3
33,Virar,Western Suburbs,19.47,72.8,11
32,Nalasopara,"Vasai,Western Suburbs",19.4154,72.8613,12
9,Mira Road,"Mira-Bhayandar,Western Suburbs",19.284167,72.871111,14
17,Dahisa,Western Suburbs,19.250069,72.859347,17
15,I.C. Colony,"Borivali (West),Western Suburbs",19.247039,72.84983,17
31,Naigaon,"Vasai,Western Suburbs",19.351467,72.846343,18
92,Thane,Mumbai,19.2,72.97,22
10,Bhayandar,"Mira-Bhayandar,Western Suburbs",19.29,72.85,28


<h4>Get the nearby famous venues in each neighbourhood using foursquare apis</h4>

In [111]:
def getNearbyVenues(names, latitudes, longitudes, radius=10000, LIMIT=120):
    """Return specific number of Nearby venues within a given radius"""
    CLIENT_ID= 'QSODRMNFQSB1U4COV520ELCDD1TE5ZDQZ5LM3A33GOZFY5GC'
    CLIENT_SECRET='3WKOZZ3TOWLZGOVQXKHNCQDWDAXXP2QC3Y5IAXCXI4RU3Q5S'    
    VERSION = '20191121'
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = rst.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'NeighborhoodLatitude', 
                  'NeighborhoodLongitude', 
                  'Venue', 
                  'VenueLatitude', 
                  'VenueLongitude', 
                  'VenueCategory']
    
    return(nearby_venues)

nearbyvenues = getNearbyVenues(df_to_process['Area'],df_to_process['Lattitide'],df_to_process['Longitude'])


Hindu colony
Nehru Nagar
Virar
Nalasopara
Mira Road
Dahisa
I.C. Colony
Naigaon
Thane
Bhayandar
Mahavir Nagar
Poisar
Nahur
Charkop
Breach Candy
Kemps Corner
Cumbala Hill
Thakur village
Gowalia Tank
Sunder Nagar
Jogeshwari West
Bandstand Promenade
Kalina
Aarey Milk Colony
Mahul


In [112]:
def prepare_data_for_clustering(df_to_process):
    temp_copy = df_to_process.copy()
    distinct_neary_vanues = nearbyvenues.VenueCategory.unique()
    distinct_neary_vanues
    df = nearbyvenues.groupby('VenueCategory').size().reset_index(name='counts')
    df = df[df['counts']>2]
    no_of_areas_to_process= df_to_process['Area'].size
    for cat in df['VenueCategory']:
        ll = []
        d = nearbyvenues[nearbyvenues['VenueCategory']==cat]
        if d.empty == False:
            for x in df_to_process['Area']:
                f = d[d['Neighborhood']== x]
                ll.append(f.size)
        else:
            ll = ll * no_of_areas_to_process
        temp_copy[cat] = ll

    temp_copy.drop(['total_pubs', 'Area', 'Location', 'Lattitide', 'Longitude'],axis=1, inplace = True)
    return temp_copy

temp_copy = prepare_data_for_clustering(df_to_process)
temp_copy 

Unnamed: 0,American Restaurant,Art Gallery,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,Bar,Beach,Bengali Restaurant,Boat or Ferry,...,Tea Room,Thai Restaurant,Theater,Theme Park,Toy / Game Store,Track,Train Station,Vegetarian / Vegan Restaurant,Water Park,Women's Store
82,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
41,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,7,0,0,0
33,7,0,0,0,0,7,0,14,0,0,...,0,0,0,7,0,0,14,7,0,0
32,0,0,0,0,0,14,0,0,0,0,...,0,0,0,7,0,0,7,0,0,0
9,0,0,0,0,7,0,7,0,0,0,...,0,0,0,14,0,0,0,0,14,0
17,0,0,0,0,7,0,14,35,0,21,...,7,0,0,14,0,0,0,0,14,0
15,0,0,0,0,7,0,14,35,0,14,...,7,0,0,14,0,0,0,0,14,0
31,0,0,0,0,0,7,0,14,0,0,...,0,0,0,0,0,0,7,0,0,0
92,0,0,7,0,0,0,0,0,0,0,...,0,0,0,7,0,0,0,7,7,0
10,0,0,0,0,7,0,7,7,0,0,...,0,0,0,14,0,0,0,0,14,0


In [121]:
df_to_process

Unnamed: 0,Area,Location,Lattitide,Longitude,total_pubs
82,Hindu colony,"Dadar,South Mumbai",19.020841,19.020841,0
41,Nehru Nagar,"Kurla,Eastern Suburbs",15.451686,74.971977,3
33,Virar,Western Suburbs,19.47,72.8,11
32,Nalasopara,"Vasai,Western Suburbs",19.4154,72.8613,12
9,Mira Road,"Mira-Bhayandar,Western Suburbs",19.284167,72.871111,14
17,Dahisa,Western Suburbs,19.250069,72.859347,17
15,I.C. Colony,"Borivali (West),Western Suburbs",19.247039,72.84983,17
31,Naigaon,"Vasai,Western Suburbs",19.351467,72.846343,18
92,Thane,Mumbai,19.2,72.97,22
10,Bhayandar,"Mira-Bhayandar,Western Suburbs",19.29,72.85,28


<H4> Use clusturing mechanism to find similarity</h4>

In [155]:
# import k-means from clustering stage
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=7, random_state=None, init='random', max_iter=300, n_init=10).fit(temp_copy)

# check cluster labels generated for each row in the dataframe
kmeans.labels_
df_to_process['ClusterLabels'] = kmeans.labels_
clusers_no_from_most_popular_nh = df_to_process['ClusterLabels'].tail(5).unique()

l = np.array(clusers_no_from_most_popular_nh).tolist()
l
df_to_process[df_to_process['ClusterLabels'].isin(l)]

Unnamed: 0,Area,Location,Lattitide,Longitude,total_pubs,ClusterLabels
55,Breach Candy,South Mumbai,18.967,72.805,47,5
66,Kemps Corner,South Mumbai,18.9629,72.8054,47,5
61,Cumbala Hill,South Mumbai,18.965833,72.805833,48,5
89,Gowalia Tank,"Tardeo,South Mumbai",18.96245,72.809703,50,5
20,Jogeshwari West,Western Suburbs,19.12,72.85,111,0
12,Bandstand Promenade,"Bandra,Western Suburbs",19.042718,72.819132,113,5
30,Kalina,"Sanctacruz,Western Suburbs",19.081667,72.841389,115,0
18,Aarey Milk Colony,"Goregaon,Western Suburbs",19.148493,72.881756,134,0
51,Mahul,"Trombay,Harbour Suburbs",19.0,72.883333,178,5


<H3> Conclusion</h3>
We know that neighbourhoods that has most Bars are in cluster 5 and 0. and we can see <b><u>'Breach Candy'</U></b> and <b><U>'Kemps Corner'</U></b> also belong to the same<br>
clusture with lowest pubs(bars). Hence i can conlude that these 2 localities are best suited neighbourhoods for our new Bar.