# Capstone Project: The Battle of Neighborhoods in Dubai

# Table of Content:

introduction

Download and Explore Dataset

Explore Neighborhoods in Dubai

Analyze Each Neighborhood

Cluster Neighborhoods

Conclusion

# 1. Introduction

### 1.1. Business Problem:

Expo 2020 Dubai is a World Expo that’s going to be hosted by Dubai in the United Arab Emirates. 
A World Expo is a mega international event in terms of size, scale, and duration and visitor numbers.
It’s a festival and a platform where people from all over the world come together and connect with each other,
share ideas, learn and innovate. It’s also a place you can come and have fun.
Expo 2020 was scheduled on 20 October 2020 – 10 April 2021. and due to the COVID-19, 
the new schedule is 1 October 2021 – 31 March 2022.
The staging of the world fair and the preparations leading up to it are expected to result an injection of nearly $40 billion into the economy, 
and an increase in visitors of at least 25 million persons from in and out UAE.



### 1.2. Target Audience:

The dataset of Dubai has been used to help the visitors, investors and the job seekers to find suitable places such as restaurants, hotels, apartments, GYM's and so on.

# 2. Data Description 

Using machine learning algorithms and web scraping, the information about the neighborhoods in Dubai as well as the average rents apartment prices was gathered in a dataframe. The Geocoder Python package was used to retrieve the latitudes and longitudes of each borough. After some modifications, this information was used as input for the Foursquare API to get the availability and information of venues in the respective neighborhoods.
This information is gathered through web scraping from this webpage: https://en.wikipedia.org/wiki/List_of_communities_in_Dubai


## 2.1 Python Libraries
 For this project the following libraries are used:

In [1]:
pip install folium

Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 5.8 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install geocoder

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 11.2 MB/s eta 0:00:01
[?25hCollecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install geopy

Note: you may need to restart the kernel to use updated packages.


In [4]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to vectorized computation
import random # library for random number generation
import itertools

#Libraries for plots
import seaborn as sns
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import folium

# module to convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
import geocoder
from geopy.exc import GeocoderTimedOut
from geopy.exc import GeocoderNotFound

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#Library for k-mean algoritm
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


## 2.2 Data preparation:

### Download and Explore Dataset:
Dubai is mainly divided into 9 sectors which are then divided into 226 communities. <br> 
In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains those boroughs and the neighborhoods that exist in each borough as well as the latitude and logitude coordinates of each neighborhood.  <br>
the dataset is scraped from this webpage:  https://en.wikipedia.org/wiki/List_of_communities_in_Dubai

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_communities_in_Dubai'
df_dubai = pd.read_html(url, header=0)[0]
df_dubai.head()

Unnamed: 0,Community Number,Community (English),Community (Arabic),Area(km2),Population(2000),Population density(/km2),Unnamed: 6
0,126.0,Abu Hail,أبو هيل,1.27 km2,21414.0,"16,861.4/km2",
1,711.0,Al Awir First,العوير الأولى,,,,
2,721.0,Al Awir Second,العوير الثانية,,,,
3,283.0,Aleyas,العياص,162.4 km2,1706.0,162.4/km2,
4,333.0,Al Bada'a,البدع,0.82 km2,18816.0,22946/km2,


In [6]:
# Drop columns that are not needed for further analysis:
df_dubai.drop(['Community (Arabic)', 'Area(km2)', 'Population(2000)', 'Population density(/km2)', 'Unnamed: 6'], axis = 1, inplace = True)
# Rename columns:
df_dubai.rename(columns={'Community (English)' : 'Community'}, inplace = True )

In [7]:
#Get Latitude and longitude of each borough and ignoring the boroughs whose data not available:
address = df_dubai['Community'].apply(lambda x: x.split('-')[-1]+', Dubai').unique()
geolocater = Nominatim(user_agent= "dubai_explorer")
location = []
empty = []
def getcoords (add):
    try:
        coords= geolocater.geocode(add, timeout = 10)
        location.append([add, coords.latitude, coords.longitude])
        print("the coords are {}".format(location[-1]))
        
    except GeocoderTimedOut:
        return getcoords(add)
    
    except:
        empty.append([add])
        print("Couldn't find coords of{}".format(empty[-1]))
        
for add in address:
    getcoords(add)


the coords are ['Abu Hail, Dubai', 25.28553635, 55.32988062793524]
the coords are ['Al Awir First, Dubai', 25.185184200000002, 55.5651697615552]
Couldn't find coords of['Al Awir Second, Dubai']
the coords are ['Aleyas, Dubai', 25.2117884, 55.536023378308464]
the coords are ["Al Bada'a, Dubai", 25.22450955, 55.26864195698753]
the coords are ['Al Baraha, Dubai', 25.2810618, 55.3194665]
the coords are ['Al Barsha First, Dubai', 25.100320500000002, 55.18206715002786]
Couldn't find coords of['Al Barsha Second, Dubai']
Couldn't find coords of['Al Barsha Third, Dubai']
the coords are ['Al Barsha South First, Dubai', 25.100320500000002, 55.18206715002786]
Couldn't find coords of['Al Barsha South Second, Dubai']
Couldn't find coords of['Al Barsha South Third, Dubai']
Couldn't find coords of['Al Barsha South Fourth, Dubai']
Couldn't find coords of['Al Barsha South Fifth, Dubai']
the coords are ['Al Buteen, Dubai', 25.26305655, 55.3205840389995]
the coords are ['Al Corniche, Dubai', 25.2838169, 5

In [8]:
# to check how many boroughs we have after cleaning: 
len(location)

86

In [9]:
# transforming the data into a pandas dataframe:
dubai_data = pd.DataFrame(location, columns= ['Community', 'Latitude', 'Longitude'])
dubai_data

Unnamed: 0,Community,Latitude,Longitude
0,"Abu Hail, Dubai",25.285536,55.329881
1,"Al Awir First, Dubai",25.185184,55.565170
2,"Aleyas, Dubai",25.211788,55.536023
3,"Al Bada'a, Dubai",25.224510,55.268642
4,"Al Baraha, Dubai",25.281062,55.319466
...,...,...,...
81,"Margham, Dubai",24.899518,55.625454
82,"Dahal, Dubai",24.745043,55.361932
83,"Saih Al Salam, Dubai",24.953595,55.500715
84,"Al Lisaili, Dubai",24.930673,55.473254


In [10]:
address = 'DUBAI'

geolocator = Nominatim(user_agent="dubai_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Dubai are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Dubai are 25.2653471, 55.2924914.


### Create a map of Dubai with neighborhoods superimposed on top

In [11]:
# create map of Dubai using latitude and longitude values
map_dubai = folium.Map(location=[latitude, longitude], zoom_start=10)
# add markers to map
for lat, lng, Community in zip(dubai_data['Latitude'], dubai_data['Longitude'], dubai_data['Community']):
    label = '{}, {}'.format(dubai_data, Community)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dubai)  
    
map_dubai

In [12]:
# Use FourSquare to explore the area around the boroughs:
CLIENT_ID = 'KOWGYDKULLUTJPNGX3ZAZGGHT5USCP2URZS2TOXEDCHCU34M' # your Foursquare ID
CLIENT_SECRET = '5TU153ELI0KAZNQWHMUIL3D3DPRXZXUVN55FLI1QDRTYN4DI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KOWGYDKULLUTJPNGX3ZAZGGHT5USCP2URZS2TOXEDCHCU34M
CLIENT_SECRET:5TU153ELI0KAZNQWHMUIL3D3DPRXZXUVN55FLI1QDRTYN4DI


In [13]:
neighborhood_latitude = dubai_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = dubai_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = dubai_data.loc[0, 'Community'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Abu Hail, Dubai are 25.28553635, 55.32988062793524.


In [14]:
# create the API request URL:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

# make the GET request:
results = requests.get(url).json()

In [15]:
# function that extracts the category of the venue:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [16]:
# Now we are ready to clean the json and structure it into a pandas dataframe:

venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

  nearby_venues = json_normalize(venues) # flatten JSON


Unnamed: 0,name,categories,lat,lng
0,Al Douri Mart Supermarket & Roastery,Supermarket,25.285869,55.328174
1,Pond Park - Al Qusais,Park,25.28806,55.332606
2,Baithak Restaurant,Asian Restaurant,25.288937,55.327372
3,Jannati Health Club and Spa,Spa,25.285408,55.325168


In [17]:
# to see how many venues were returned by Foursquare in Abu Hail:
print('we have {} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

we have 4 venues were returned by Foursquare.


# 3. Explore Neighborhoods in Dubai:

### Now we will retrive the venue data present within 500 meter radius of each neighborhood using Foursquare API and merge with the above table

In [18]:
# A function is defined, that takes as input the borough as well as the latitude and longitude and gives back the venues around the location provided:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [19]:
# create a new dataframe called dubai_venues to run the above function:
dubai_venues = getNearbyVenues(names=dubai_data['Community'],
                                   latitudes=dubai_data['Latitude'],
                                   longitudes=dubai_data['Longitude']
                                  )

#checking the size of the resulting dataframe:
print(dubai_venues.shape)
dubai_venues.head()

Abu Hail, Dubai
Al Awir First, Dubai
Aleyas, Dubai
Al Bada'a, Dubai
Al Baraha, Dubai
Al Barsha First, Dubai
Al Barsha South First, Dubai
Al Buteen, Dubai
Al Corniche, Dubai
Al Dhagaya, Dubai
Al Faqa, Dubai
Al Garhoud, Dubai
Al Hamriya, Dubai, Dubai
Al Hamriya Port, Dubai
Al Hudaiba, Dubai
Al Jaddaf, Dubai
Al Jafiliya, Dubai
Al Karama, Dubai
Al Khabisi, Dubai
Al Khawaneej First, Dubai
Al Kifaf, Dubai
Al Mamzar, Dubai
Al Manara, Dubai
Al Mankhool, Dubai
Al Merkad, Dubai
Al Mina, Dubai
Al Mizhar First, Dubai
Al Muraqqabat, Dubai
Al Murar, Dubai
Al Mushrif, Dubai
Al Muteena, Dubai
Al Nahda First, Dubai
Al Nasr, Dubai, Dubai
Al Quoz First, Dubai
Al Quoz Industrial First, Dubai
Al Qusais First, Dubai
Al Raffa, Dubai
Al Ras, Dubai
Al Rashidiya, Dubai
Al Rigga, Dubai
Al Sabkha, Dubai
Al Safa First, Dubai
Al Satwa, Dubai
Al Shindagha, Dubai
Al Souq Al Kabeer, Dubai
Al Twar First, Dubai
Al Warqa'a Third, Dubai
Al Wasl, Dubai
Al Waheda, Dubai
Ayal Nasir, Dubai
Business Bay, Dubai
Bu Kadra, Dubai


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Abu Hail, Dubai",25.285536,55.329881,Al Douri Mart Supermarket & Roastery,25.285869,55.328174,Supermarket
1,"Abu Hail, Dubai",25.285536,55.329881,Pond Park - Al Qusais,25.28806,55.332606,Park
2,"Abu Hail, Dubai",25.285536,55.329881,Baithak Restaurant,25.288937,55.327372,Asian Restaurant
3,"Abu Hail, Dubai",25.285536,55.329881,Jannati Health Club and Spa,25.285408,55.325168,Spa
4,"Al Awir First, Dubai",25.185184,55.56517,al qbabh restaurant,25.183802,55.567921,Seafood Restaurant


### Let's check how many venues were returned for each neighborhood:

In [20]:
dubai_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Abu Hail, Dubai",4,4,4,4,4,4
"Al Awir First, Dubai",4,4,4,4,4,4
"Al Bada'a, Dubai",6,6,6,6,6,6
"Al Baraha, Dubai",10,10,10,10,10,10
"Al Barsha First, Dubai",32,32,32,32,32,32
...,...,...,...,...,...,...
"Umm Hurair First, Dubai",2,2,2,2,2,2
"Umm Ramool, Dubai",5,5,5,5,5,5
"Umm Suqeim First, Dubai",19,19,19,19,19,19
"Warsan First, Dubai",4,4,4,4,4,4


In [21]:
#how many unique categories can be curated from all the returned venues:
print('There are {} uniques categories.'.format(len(dubai_venues['Venue Category'].unique())))

There are 205 uniques categories.


# 4. Analyze Each Neighborhood

In [22]:
# one hot encoding
dubai_onehot = pd.get_dummies(dubai_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dubai_onehot['Neighborhood'] = dubai_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [dubai_onehot.columns[-1]] + list(dubai_onehot.columns[:-1])
dubai_onehot = dubai_onehot[fixed_columns]

dubai_onehot.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Arcade,Art Gallery,Asian Restaurant,Athletics & Sports,Auto Garage,...,Tram Station,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Water Park,Wine Bar,Women's Store,Yemeni Restaurant
0,"Abu Hail, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Abu Hail, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Abu Hail, Dubai",0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Abu Hail, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Al Awir First, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
dubai_onehot.shape #dataframe size

(1399, 206)

#### create dubai group rows by neighborhood and by taking the mean of the frequency of occurrence of each category:

In [25]:
dubai_venue_grouped = dubai_onehot.groupby('Neighborhood').sum().reset_index()
dubai_venue_grouped

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Arcade,Art Gallery,Asian Restaurant,Athletics & Sports,Auto Garage,...,Tram Station,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Water Park,Wine Bar,Women's Store,Yemeni Restaurant
0,"Abu Hail, Dubai",0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Al Awir First, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Al Bada'a, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Al Baraha, Dubai",0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Al Barsha First, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,"Umm Hurair First, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
71,"Umm Ramool, Dubai",0,0,0,0,0,0,0,0,4,...,0,0,0,0,0,0,0,0,0,0
72,"Umm Suqeim First, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
73,"Warsan First, Dubai",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 
Looking at the dataframe, it is obvious, that there is a column fo every specific categorie and sub-categorie. However, we are interestet in the restaurants,no matter which type of restaurant it is. <br> The same is true for all of the other venues,  <br>
That is why we need to modify this dataframe by summing up all of the sub-categories to the main categories: coffees, grocery stores, parks, gyms, hotels and restaurants. <br> 
The following steps are perfomed: <br>
1- Create a reference dataframe (dubai_grouped_sum ), that contains the same indexing as our dataframe dubai_venue_grouped. <br>
2- Create a separate dataframe for every category containing the sum of the specific category . It is important to keep the indexing from the original dataframe. <br>
3- Combine all of the dataframes based on their index. <br>


In [26]:
#Create a reference dataframe (dubai_grouped_sum), that contains the same indexing as our dataframe dubai_venue_grouped:
dubai_grouped_sum = pd.DataFrame()
dubai_grouped_sum = dubai_venue_grouped.iloc[:, :1]

#Create a separate dataframe for every category containing the sum of the specific category:
##gyms:
gyms = dubai_venue_grouped.filter(like='Gym',axis=1).sum(axis = 1).reset_index(name ='Gyms')
gyms.set_index(['index'], inplace = True)
##hotels
hotels = dubai_venue_grouped.filter(like='Hotel',axis=1).sum(axis = 1).reset_index(name ='Hotels')
hotels.set_index(['index'], inplace = True)
##restaurants:
restaurants = dubai_venue_grouped.filter(like='Restaurant',axis=1).sum(axis = 1).reset_index(name ='Restaurants')
restaurants.set_index(['index'], inplace = True)
##coffe_shopes
coffee_shops = dubai_venue_grouped.filter(like='Coffee',axis=1).sum(axis = 1).reset_index(name ='Coffee_Shops')
coffee_shops.set_index(['index'], inplace = True)
##parks
parks = dubai_venue_grouped.filter(like='Park',axis=1).sum(axis = 1).reset_index(name ='Parks')
parks.set_index(['index'], inplace = True)
##grocery
grocery = dubai_venue_grouped.filter(like='Grocery',axis=1).sum(axis = 1).reset_index(name ='Grocery Stores')
grocery.set_index(['index'], inplace = True)

In [27]:
#Combine all of the dataframes based on their index
dubai_grouped_sum = pd.concat([dubai_grouped_sum, gyms, hotels, restaurants, coffee_shops, parks, grocery], axis=1)
dubai_grouped_sum.head()

Unnamed: 0,Neighborhood,Gyms,Hotels,Restaurants,Coffee_Shops,Parks,Grocery Stores
0,"Abu Hail, Dubai",0,0,1,0,1,0
1,"Al Awir First, Dubai",0,0,2,1,0,0
2,"Al Bada'a, Dubai",0,0,1,0,2,0
3,"Al Baraha, Dubai",0,2,3,0,0,0
4,"Al Barsha First, Dubai",0,11,6,1,0,1


#### Now that we have the number of every category, we create a "Favourite Score" by building the normalized sum of favourite venues for every neigborhood

In [28]:
#build sum
dubai_grouped_sum['FavoriteScore'] = dubai_grouped_sum.sum(axis=1)

#normalize values
dubai_grouped_sum['FavoriteScore'] = dubai_grouped_sum['FavoriteScore']/dubai_grouped_sum["FavoriteScore"].sum()

#sort dataframe
dubai_grouped_sum.sort_values("FavoriteScore", ascending = False, inplace = True)
dubai_grouped_sum.reset_index(drop = True).head(10)

Unnamed: 0,Neighborhood,Gyms,Hotels,Restaurants,Coffee_Shops,Parks,Grocery Stores,FavoriteScore
0,"Trade Centre 1, Dubai",6,9,30,11,0,1,0.075899
1,"Al Karama, Dubai",1,1,54,0,0,0,0.074567
2,"Trade Centre 2, Dubai",4,4,31,10,0,1,0.066578
3,"Za'abeel First, Dubai",0,2,38,1,2,0,0.057257
4,"Marsa Dubai, Dubai",2,5,26,1,0,0,0.045273
5,"Downtown Dubai, Dubai",1,7,17,4,2,1,0.04261
6,"Al Nasr, Dubai, Dubai",1,3,25,2,1,0,0.04261
7,"Rigga Al Buteen, Dubai",0,15,12,1,1,0,0.038615
8,"Al Souq Al Kabeer, Dubai",0,4,21,1,0,0,0.034621
9,"Al Buteen, Dubai",1,9,11,1,0,1,0.030626


### Based on the  above list, we tend to say that all this neighborhoods has the highest "Favourite Score" 

#### print each neighborhood along with the top 5 most common venues

In [29]:
num_top_venues = 5

for hood in dubai_venue_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = dubai_venue_grouped[dubai_venue_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Abu Hail, Dubai----
               venue  freq
0        Supermarket   1.0
1   Asian Restaurant   1.0
2                Spa   1.0
3               Park   1.0
4  Afghan Restaurant   0.0


----Al Awir First, Dubai----
                  venue  freq
0     Convenience Store   1.0
1  Fast Food Restaurant   1.0
2    Seafood Restaurant   1.0
3           Coffee Shop   1.0
4     Afghan Restaurant   0.0


----Al Bada'a, Dubai----
                       venue  freq
0                       Park   2.0
1                       Café   1.0
2                       Pool   1.0
3                Tailor Shop   1.0
4  Middle Eastern Restaurant   1.0


----Al Baraha, Dubai----
                       venue  freq
0  Middle Eastern Restaurant   2.0
1                      Hotel   2.0
2                IT Services   1.0
3        American Restaurant   1.0
4                        Spa   1.0


----Al Barsha First, Dubai----
                       venue  freq
0                      Hotel  10.0
1                        P

### Let's put that into a pandas dataframe:

In [30]:
#to sort the venues in descending order:

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### to create the new dataframe and display the top 10 venues for each neighborhood:

In [31]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = dubai_venue_grouped['Neighborhood']

for ind in np.arange(dubai_venue_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dubai_venue_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Abu Hail, Dubai",Supermarket,Asian Restaurant,Spa,Park,Afghan Restaurant,Performing Arts Venue,Nail Salon,Nightclub,North Indian Restaurant,Office
1,"Al Awir First, Dubai",Convenience Store,Fast Food Restaurant,Seafood Restaurant,Coffee Shop,Afghan Restaurant,North Indian Restaurant,Office,Organic Grocery,Outdoors & Recreation,Pakistani Restaurant
2,"Al Bada'a, Dubai",Park,Café,Pool,Tailor Shop,Middle Eastern Restaurant,Afghan Restaurant,Nail Salon,Nightclub,North Indian Restaurant,Office
3,"Al Baraha, Dubai",Middle Eastern Restaurant,Hotel,IT Services,American Restaurant,Spa,Lounge,Convenience Store,Café,Performing Arts Venue,North Indian Restaurant
4,"Al Barsha First, Dubai",Hotel,Pub,Middle Eastern Restaurant,Mexican Restaurant,Coffee Shop,Nightclub,Organic Grocery,Spa,Café,Buffet


## Cluster Neighborhoods

### Run k-means to cluster the neighborhood into 5 clusters

In [32]:
dubai_grouped = dubai_onehot.groupby('Neighborhood').mean().reset_index()
dubai_grouped

Unnamed: 0,Neighborhood,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Arcade,Art Gallery,Asian Restaurant,Athletics & Sports,Auto Garage,...,Tram Station,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Water Park,Wine Bar,Women's Store,Yemeni Restaurant
0,"Abu Hail, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.250000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,"Al Awir First, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,"Al Bada'a, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,"Al Baraha, Dubai",0.0,0.000000,0.0,0.100000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
4,"Al Barsha First, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,"Umm Hurair First, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
71,"Umm Ramool, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.8,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
72,"Umm Suqeim First, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
73,"Warsan First, Dubai",0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
# set number of clusters
kclusters = 5

dubai_grouped_clustering = dubai_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dubai_grouped_clustering)


In [34]:
# add clustering labels
dubai_grouped.insert(0, 'Cluster labels', kmeans.labels_)

dubai_merged = dubai_data

#rename columns as we will mergethe dataframes based on the neighborhoods
dubai_data.rename(columns={'Community':'Neighborhood'}, inplace = True)

# merge dubai_grouped with dubai_data to add latitude/longitude for each neighborhood
dubai_merged = dubai_merged.join(dubai_grouped.set_index('Neighborhood'), on='Neighborhood')

dubai_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster labels,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Arcade,Art Gallery,...,Tram Station,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Water Park,Wine Bar,Women's Store,Yemeni Restaurant
0,"Abu Hail, Dubai",25.285536,55.329881,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Al Awir First, Dubai",25.185184,55.56517,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Aleyas, Dubai",25.211788,55.536023,,,,,,,,...,,,,,,,,,,
3,"Al Bada'a, Dubai",25.22451,55.268642,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Al Baraha, Dubai",25.281062,55.319466,3.0,0.0,0.0,0.0,0.1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
dubai_merged.dropna(inplace = True)

In [36]:
dubai_merged.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster labels,Afghan Restaurant,African Restaurant,Airport Terminal,American Restaurant,Arcade,Art Gallery,...,Tram Station,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Water Park,Wine Bar,Women's Store,Yemeni Restaurant
0,"Abu Hail, Dubai",25.285536,55.329881,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Al Awir First, Dubai",25.185184,55.56517,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Al Bada'a, Dubai",25.22451,55.268642,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Al Baraha, Dubai",25.281062,55.319466,3.0,0.0,0.0,0.0,0.1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Al Barsha First, Dubai",25.100321,55.182067,3.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Finally, let's visualize the resulting clusters

In [38]:
dubai_merged['Cluster labels']=dubai_merged['Cluster labels'].astype(int)

In [39]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dubai_merged['Latitude'], dubai_merged['Longitude'], dubai_merged['Neighborhood'], dubai_merged['Cluster labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# 4. Results and Discussion:

An important first observation that we can take about the presentation of the districts on the dubai map is that a segmentation occurred satisfactorily from the informed parameters. <br>
it is easy to notice that the color distribution on the map follows a logic according to the neighborhood categories, where the colors suggest a region for which the districts in question really present a similar function in the services we are looking for.<br>
for example the  purple dots shows that  (cluster 1) suggests an extended center, with many restaurants, hotels, parks, and cafes.

# 6- Conclusion :
Here is the end of the data analysis and comment about information.<br>
on this project I have made an effort to help the first time travelers to Dubai especially to those how want to atend the big fistival (Expo 2020). <br> 
I have used some common libraries like geopy, folium to find the location and plot those locations on map respectively. Also, I have made use of foursquare API to explore the venues of each neighborhoods.<br>
It is so interesting that we can compare the boroughs of dubai as a place to live on or for visiting. that are important to know the venues we used for help us to take this decision is   (hotels, cafes, parks, restaurants, grocery stores and gyms). <br> 
Other factors could be added such as transportation, rents and population per borough.