# <center>Analysis on the Transporation Infrastructure Extension in Dubai</center>

**Author:** Cesar Hanna

**Date:** March 3, 2021

## **Table of Content**

1. [Introduction](#Introduction)
2. [The Challenge](#The-Challenge)
3. [Description of the Solution](#Description-of-the-Solution)
4. [Description of Data](#Description-of-Data)
5. [Data References](#Data-References)
6. [Decision Criterion](#Decision-Criterion)
7. [Data Frame](#Data-Frame)

## **Introduction**

Dubai is the most populous city in the United Arab Emirates (UAE) and the capital of the Emirate of Dubai.

Located in the eastern part of the Arabian Peninsula on the coast of the Persian Gulf, Dubai aims to be the business hub of Western Asia. It is also a major global transport hub for passengers and cargo. Oil revenue helped accelerate the development of the city, which was already a major mercantile hub. Dubai's oil output made up 2.1 percent of the Persian Gulf emirates economy in 2008. A centre for regional and international trade since the early 20th century, Dubai's economy relies on revenues from trade, tourism, aviation, real estate, and financial services. According to government data, the population of Dubai is estimated at around 3,400,800 as of 8 September 2020.

## **The Challenge**

With the fast growing economy and the influx of expats to work in all sectors, Dubai has become quite populated. However, currently commuting to work, to school and/or to venues is heavily dependent on either cars or taxis. There is a metro that connects few areas, but it is not covering enough regions that people can rely on; same goes to the bus system that they have there but not enough coverage as well.

One of the major worldwide events is taking place in Dubai and that is Expo 2020; Dubai Expo 2020 is a hot topic, due to the increasing pressure that will be placed on Dubai’s transport systems and existing infrastructure. The aftermath of the event is also talked about. in terms of how the construction work taking place to accommodate the event will impact Dubai in the ensuring years.

This is creating few problems to name few:

- Financial: Commuting by taxi can be expensive. A lot of families cannot afford that.

- Infrastructure: Infrastructure congestion is hindering the mobility which in turn can have an impact on businesses and ultimately the economy.

- Environmental: The United Arab Emirates are a contributor to greenhouse gas emissions, listed as having the 29th highest carbon dioxide emissions. Since the boom of the oil industry occurred in the early 21st century, the population and its consumption of energy have sharply increased.

## **Description of the Solution**

One effective way to solve these problems is to improve the transportation system by extending the metro system and building obviously few other underground stations.

For the business decision to be effective on where to build those stations, I have considered couple of important features such as population, venues that also include schools and hospitals and nearest metro station. More details on those features will be explained in the next sections.

## **Description of Data**

The dataframe includes independent variables or features that will be used to cluster the communities accordingly.
These features are:

- Community Number

- Community Name

- Population

- Number of venues

- Latitude

- Longitude

- Nearest walking distance MS from the centre of the community (Km)


## **Data References**

The data will be extracted from:

- Wikipedia: to get all the community information - https://en.wikipedia.org/wiki/List_of_communities_in_Dubai

- Google Maps: to get the metro stations walking distances – https://www.google.com/maps

- Google: to get coordinates – https://www.google.de/

- Other site(s) to get coordinates: https://www.distancesfrom.com/ae and https://vymaps.com/AE

- Foursquare API: to get the venues


## **Decision Criterion**

Based on the data at hand and for the sake of analysis only (of course there are a lot of variables to be considered to take a decision for such a project, such as cost, environmental impact, available area and many more), I am going to use a group of variables, to be able to take a decision on where to extend the metro infrastructure and build new stations.

It will be based on certain logical criterion to follow, taking into consideration that the metro stations will be built in the centre of communities that are conforming to the criterion. The reason for this consideration is that these communities have a fairly small surface area thus building a station in the centre should solve the problem.

In real life situations it is not always the case that all 3 features, population, number of venues and nearest walking distance metro station, are satisfying the criterion, therefore if 2 criterion are met it should be enough to decide.

Threshold defined in numbers:

- Population: >= 1000

- Number of venues: a ratio of 15 venues for every 1000 people (15:1000); that means the number of venues should be >=15 based on the aforementioned population criteria

- Nearest walking distance MS from the centre of the community: >=1.7 Km


## **Data Frame**

- **Step 1**: Importing and installing all the necessary libraries and packages

- **Step 2**: Scraping the first table from wikipedia and putting it in a data frame.

- **Step 3**: Creating the second data frame with the coordinates and metro station distance features.

- **Step 4**: Merging the first and second data frames so we can prepare the resulted data frame to incorporate the venues later on.

- **Step 5**: Getting the data from Foursquare showing the venues of each community, and creating the third data frame.

- **Step 6**: Creating “Number of Venues” column, and name the new data frame "communities_count_venues_updated".

- **Step 7**: Merging all 3 data frames and droping the un-needed columns

<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step 1: Importing and installing all the necessary libraries and packages**

In [1]:
#Library to handle data in a vectorized manner
import numpy as np

#Library for data analsysis
import pandas as pd

#Library to handle JSON files
import json

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim #convert an address into latitude and longitude values

import requests #library to handle requests
from pandas.io.json import json_normalize #tranform JSON file into a pandas dataframe

#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#Map rendering library
!conda install -c conda-forge folium=0.5.0 --yes
import folium

print('Libraries imported.')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step 2: Scraping the first table from wikipedia and putting it in a data frame.**

In [68]:
html_table = pd.read_html("https://en.wikipedia.org/wiki/List_of_communities_in_Dubai")
df_wiki = html_table[0] #choosing the first table in the wikipage
df_wiki

Unnamed: 0,Community Number,Community (English),Community (Arabic),Area(km2),Population(2000),Population density(/km2),Unnamed: 6
0,126.0,Abu Hail,أبو هيل,1.27 km²,21414,"16,861.4/km²",
1,711.0,Al Awir First,العوير الأولى,,,,
2,721.0,Al Awir Second,العوير الثانية,,,,
3,283.0,Aleyas,العياص,162.4 km2,1706,162.4/km2,
4,333.0,Al Bada'a,البدع,0.82 km²,18816,22946/km²,
...,...,...,...,...,...,...,...
141,914.0,Umm Nahad Fourth,,,,,
142,971.0,Saih Al-Dahal,,,,,
143,951.0,Saih Al Salam,,,,,
144,931.0,Al Lisaili,,,,,


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Renaming "Population(2000)" column to "Population", "Community (English)" to "Community" and removing rows with NaN as population value

In [69]:
df_wiki_col_rename = df_wiki.rename(columns={'Population(2000)':'Population', 'Community (English)':'Community'})
df_wiki_nan = df_wiki_col_rename[df_wiki_col_rename['Population'].notna()]

#Droping the rows with value '0'
df_wiki_updated = df_wiki_nan[df_wiki_nan.Population != '0']
df_wiki_updated

Unnamed: 0,Community Number,Community,Community (Arabic),Area(km2),Population,Population density(/km2),Unnamed: 6
0,126.0,Abu Hail,أبو هيل,1.27 km²,21414,"16,861.4/km²",
3,283.0,Aleyas,العياص,162.4 km2,1706,162.4/km2,
4,333.0,Al Bada'a,البدع,0.82 km²,18816,22946/km²,
5,122.0,Al Baraha,البراحة,1.104 km²,7823,"7,086/km²",
6,373.0,Al Barsha First,البرشاء الأولى,38.1 km²,1248,33/km²,
...,...,...,...,...,...,...,...
132,621.0,Warsan First,ورسان الاولى,17.1 km²,1421,83/km²,
133,622.0,Warsan Second,ورسان الثانية,17.1 km²,1421,83/km²,
134,861.0,Yaraah,يراح,83.8 km2,1222,12/km2,
135,325.0,Za'abeel First,زعبيل الأولى,10 km²,5283,528.3/km²,


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Removing columns "Unnamed: 6", "Community (Arabic)" and "Population density(/km2)" as these are not part of the data needed for clustering.

In [70]:
df_wiki_final = df_wiki_updated.drop(['Unnamed: 6', 'Community (Arabic)', 'Area(km2)', 'Population density(/km2)'], 1)
df_wiki_final

Unnamed: 0,Community Number,Community,Population
0,126.0,Abu Hail,21414
3,283.0,Aleyas,1706
4,333.0,Al Bada'a,18816
5,122.0,Al Baraha,7823
6,373.0,Al Barsha First,1248
...,...,...,...
132,621.0,Warsan First,1421
133,622.0,Warsan Second,1421
134,861.0,Yaraah,1222
135,325.0,Za'abeel First,5283


<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step 3: Creating the second data frame with the coordinates and metro station distance features.**

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**<span style="color:green">This step is done by googling each community coordinates separately and then measuring, using Google Maps, the distance from the<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;centre of the community to the nearest metro station, since there is no such data available in bulk. The data is saved into a .csv file.</span>**

In [105]:
df_coord_ms = pd.read_csv(r"C:\Users\cesar\OneDrive\Documents\Cesar documents\IBM Data Science Certificate Program\Course 10 final capstone project\Dubai_Communities_Data_Features_Updated.csv")
df_coord_ms

Unnamed: 0,Community Number,Community Name (English),Latitude,Longitude,Nearest walking distance MS from the centre of the community (km)
0,126.0,Abu Hail,25.2859,55.3282,2.6
1,711.0,Al Awir First,,,
2,721.0,Al Awir Second,,,
3,283.0,Aleyas,25.2096,55.5495,18.0
4,333.0,Al Bada'a,25.2247,55.2687,2.4
...,...,...,...,...,...
141,914.0,Umm Nahad Fourth,,,
142,971.0,Saih Al-Dahal,,,
143,951.0,Saih Al Salam,,,
144,931.0,Al Lisaili,,,


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Removing rows with NaN as latitude value and renaming "Community Name (English)" to "Community"

In [106]:
df_coord_ms_updated = df_coord_ms[df_coord_ms['Latitude'].notna()]
df_coord_ms_final = df_coord_ms_updated.rename(columns={'Community Name (English)':'Community'})
df_coord_ms_final

Unnamed: 0,Community Number,Community,Latitude,Longitude,Nearest walking distance MS from the centre of the community (km)
0,126.0,Abu Hail,25.285900,55.328200,2.60
3,283.0,Aleyas,25.209600,55.549500,18.00
4,333.0,Al Bada'a,25.224700,55.268700,2.40
5,122.0,Al Baraha,25.282000,55.318500,2.40
6,373.0,Al Barsha First,25.647000,55.811520,0.75
...,...,...,...,...,...
132,621.0,Warsan First,25.162687,55.422592,15.70
133,622.0,Warsan Second,25.164060,55.441080,17.00
134,861.0,Yaraah,24.454800,55.401400,117.00
135,325.0,Za'abeel First,25.223100,55.306100,3.50


<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step 4: Merging the first and second data frames so we can prepare the resulted data frame to incorporate the venues later on.**

In [107]:
df_wiki_coord_ms_merged = df_wiki_final[['Community Number','Community', 'Population']].merge(df_coord_ms_final[['Latitude', 'Longitude',
                                         'Nearest walking distance MS from the centre of the community (km)', 'Community Number', 
                                         'Community']], on='Community', how='left').drop('Community Number_y', 1)
df_wiki_coord_ms_merged_final = df_wiki_coord_ms_merged.rename(columns={'Community Number_x':'Community Number'})
df_wiki_coord_ms_merged_final

Unnamed: 0,Community Number,Community,Population,Latitude,Longitude,Nearest walking distance MS from the centre of the community (km)
0,126.0,Abu Hail,21414,25.285900,55.328200,2.60
1,283.0,Aleyas,1706,25.209600,55.549500,18.00
2,333.0,Al Bada'a,18816,25.224700,55.268700,2.40
3,122.0,Al Baraha,7823,25.282000,55.318500,2.40
4,373.0,Al Barsha First,1248,25.647000,55.811520,0.75
...,...,...,...,...,...,...
109,621.0,Warsan First,1421,25.162687,55.422592,15.70
110,622.0,Warsan Second,1421,25.164060,55.441080,17.00
111,861.0,Yaraah,1222,24.454800,55.401400,117.00
112,325.0,Za'abeel First,5283,25.223100,55.306100,3.50


<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step 5: Getting the data from Foursquare showing the venues of each community, and creating the third data frame.**

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Getting the coordinates of Dubai

In [104]:
address = 'Dubai, UAE'

geolocator = Nominatim(user_agent="dubai_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Dubai are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Dubai are 25.0750095, 55.18876088183319.


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Creating the Dubai map showing all the communities as blue markers

In [109]:
#create map of Duabi using latitude and longitude values
map_dubai = folium.Map(location=[latitude, longitude], zoom_start=10)

#add markers to map
for lat, lng, community in zip(df_wiki_coord_ms_merged_final['Latitude'], df_wiki_coord_ms_merged_final['Longitude'], df_wiki_coord_ms_merged_final['Community']):
    label = '{}'.format(community)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dubai)  
    
map_dubai

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Defining the Foursquare Credentials and Version

In [110]:
# @hidden_cell
CLIENT_ID = 'SLJD1F2MZNYFTYKZZRFWQVKOPKGIM4QJVIAEK21PPABOIQ2A'
CLIENT_SECRET = 'ELNISHJKRIUXAY2ZUPTN1AWQOEDKXSZFJ4HKLEZ4EQW2OWOH'
VERSION = '20200605'
LIMIT = 100

print('Credentials are defined')

Credentials are defined


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Function that explores all the communities in Dubai

In [170]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        #Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        #Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        #Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Community', 
                  'Community Latitude', 
                  'Community Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Using the above function to create the data frame showing the venues in Dubai

In [171]:
communities_venues = getNearbyVenues(names=df_wiki_coord_ms_merged_final['Community'], latitudes=df_wiki_coord_ms_merged_final['Latitude'], longitudes=df_wiki_coord_ms_merged_final['Longitude'])
communities_venues

Unnamed: 0,Community,Community Latitude,Community Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abu Hail,25.2859,55.3282,Lively,25.285194,55.325276,Track
1,Abu Hail,25.2859,55.3282,Jannati Health Club and Spa,25.285408,55.325168,Spa
2,Abu Hail,25.2859,55.3282,Baithak Restaurant,25.288937,55.327372,Asian Restaurant
3,Abu Hail,25.2859,55.3282,Emirates Post - Abu Hail Post Office,25.286184,55.323577,Post Office
4,Al Bada'a,25.2247,55.2687,Al Boom Diving Club,25.227329,55.266449,Pool
...,...,...,...,...,...,...,...
1405,Za'abeel First,25.2231,55.3061,zabeel grocery,25.223506,55.309264,Grocery Store
1406,Za'abeel First,25.2231,55.3061,za'abeel center,25.223566,55.309284,Shopping Mall
1407,Za'abeel First,25.2231,55.3061,Divan,25.220000,55.306600,Coffee Shop
1408,Za'abeel First,25.2231,55.3061,Coco's Restaurant,25.219988,55.304141,Restaurant


<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step 6: Creating “Number of Venues” column, and name the new data frame "communities_count_venues_updated".**

In [198]:
communities_count_venues = communities_venues.groupby(['Community']).count().reset_index()

#Venue Category is not needed, so let's rename it
communities_count_venues = communities_count_venues.rename(columns={'Venue Category':'Number of Venues'})

communities_count_venues_updated = communities_count_venues.drop(['Community Latitude', 'Community Longitude', 'Venue',
                                                                 'Venue Latitude', 'Venue Longitude'], 1)
communities_count_venues_updated

Unnamed: 0,Community,Number of Venues
0,Abu Hail,4
1,Al Bada'a,7
2,Al Baraha,10
3,Al Barsha Second,2
4,Al Barsha South Fifth,8
...,...,...
98,Umm Suqeim Second,7
99,Umm Suqeim Third,6
100,Warsan First,4
101,Za'abeel First,4


<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Step 7: Merging all 3 data frames and droping the un-needed columns.**

In [199]:
final_df_merged = df_wiki_coord_ms_merged_final[['Community Number','Community', 'Population', 'Latitude', 'Longitude', 
                                                'Nearest walking distance MS from the centre of the community (km)']].merge(
                                                communities_count_venues_updated[['Community', 'Number of Venues']], on='Community', how='left')
final_df_merged_updated = final_df_merged[final_df_merged['Number of Venues'].notna()].reset_index()
final_df_merged_updated

Unnamed: 0,index,Community Number,Community,Population,Latitude,Longitude,Nearest walking distance MS from the centre of the community (km),Number of Venues
0,0,126.0,Abu Hail,21414,25.285900,55.328200,2.6,4.0
1,2,333.0,Al Bada'a,18816,25.224700,55.268700,2.4,7.0
2,3,122.0,Al Baraha,7823,25.282000,55.318500,2.4,10.0
3,5,376.0,Al Barsha Second,1248,25.099700,55.212100,2.7,2.0
4,6,375.0,Al Barsha Third,1248,25.095300,55.195500,2.9,4.0
...,...,...,...,...,...,...,...,...
98,107,362.0,Umm Suqeim Second,16459,25.149900,55.206600,3.5,7.0
99,108,366.0,Umm Suqeim Third,16459,25.136700,55.195500,4.9,6.0
100,109,621.0,Warsan First,1421,25.162687,55.422592,15.7,4.0
101,112,325.0,Za'abeel First,5283,25.223100,55.306100,3.5,4.0


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Converting the "Number of Venues" column to integer

In [200]:
final_df_merged_updated = final_df_merged_updated.astype({'Number of Venues':int})
final_df_merged_updated

Unnamed: 0,index,Community Number,Community,Population,Latitude,Longitude,Nearest walking distance MS from the centre of the community (km),Number of Venues
0,0,126.0,Abu Hail,21414,25.285900,55.328200,2.6,4
1,2,333.0,Al Bada'a,18816,25.224700,55.268700,2.4,7
2,3,122.0,Al Baraha,7823,25.282000,55.318500,2.4,10
3,5,376.0,Al Barsha Second,1248,25.099700,55.212100,2.7,2
4,6,375.0,Al Barsha Third,1248,25.095300,55.195500,2.9,4
...,...,...,...,...,...,...,...,...
98,107,362.0,Umm Suqeim Second,16459,25.149900,55.206600,3.5,7
99,108,366.0,Umm Suqeim Third,16459,25.136700,55.195500,4.9,6
100,109,621.0,Warsan First,1421,25.162687,55.422592,15.7,4
101,112,325.0,Za'abeel First,5283,25.223100,55.306100,3.5,4
