#Capstone Project - The Battle of Neighbourhoods
##Part 1 [Week 1]
________________________________________________
Clearly define a problem or an idea of your choice, where you would need to leverage the Foursquare location data to solve or execute. Remember that data science problems always target an audience and are meant to help a group of stakeholders solve a problem, so make sure that you explicitly describe your audience and why they would care about your problem.

This submission will eventually become your **Introduction / Business** Problem section in your final report. So I recommend that you push the report (having your Introduction/Business Problem section only for now) to your Github repository and submit a link to it.

1. A description of the problem and a discussion of the background. (15 marks)
2. A description of the data and how it will be used to solve the problem. (15 marks)

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The aim of this project is to find the best location to open eatery in the central of **Paris, France**. Specifically, the analysis in this report will be catered to the board of directors who are interested in opening up different eatries in densely populated place like Paris.

Based on the past quantative data from other eateries, it suggests that the best locations to open eateries may not be only near other restaurants. But the data also suggest that the popular areas are in fact close to Fashion outlets, Wine Bars. This could be due to the attributes that Parisians are very sociable people which leads to a high frequency of visits amomg these locations.

The first task would be to **analyse the district data**, all the district data in Paris including names, location data if available and the frequency distribution of each district.

The second task would be to **analyse** each of this district that has the highest frequency distribution of the **Restaurants, Fashion outlets and Wine Bars**, which will enable us to pin-point which districts are the best for opening eateries.

Throughout the project, data science tools to analyse data will focus on processing the data and explore the most suitable neighbourhood so that the best neighbourhood can be identified for the eateries to be open.

The analysis and recommendations conducted for the new store will only be focusing the on general districts with these establishments, not down to the specific addresses. The results provide will narrow down the best district options for either further research to be conducted as there will be more elements required to open a store front.


## Data <a name="data"></a>


In this section, I will describe the data based on definition of the problem, factors that will influence decisions are:
* finding the suitable resources for the district data for Paris
* conduct exploratory research on the dataset
* Clean the data and convert to a useable dataset for Data Analysis


I will be using the geographical coordinates of Paris, France to plot neighbourhoods in a borough that is within the city's vicinity, and finally cluster the neighborhoods to present findings.

Arrondissements Municipaux for Paris CSV (administrative districts)
Paris is divided into 20 Arrondissements Municipaux (or administrative districts), shortened to just arrondissements. They and normally referenced by the arrondissement number rather than a name.

Following data sources will be needed to extract/generate the required information:

- [**Part 1**: Using a real world data set from Wikipedia containing all the Arrondissments in Paris](#part1):  A dataset consisting of the Arrondissements in Paris. Based on this dataset, we will be able to clearly separate Paris in to their respective arrondissments, laying the foundation of the research. Such arrangement enable us to understand the characteriitcs of the respective district later on in the analysis e.g. Which district has the least restaurants? Highest number of Cafe. 

- [**Part 2**: Data from Foursquare API to explore the Arrondissements of Paris (Neighbourhoods)](#part2): The dataset that consist of all the different shops within each Arrondissements and the respective variables. The data from this Foursquare will be supplementing more details on top of the districts to provide necessary data for other analysis such  as frequency visit, competition analysis, cost price analysis (not covered in this project)

In [35]:
# Import libraries
import numpy as np # library to handle data in a vectorized manner
import json # library to handle JSON files
import pandas as pd

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

from bs4 import BeautifulSoup

# Import k-means from clustering stage
from sklearn.cluster import KMeans

!pip install folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


###**Part 1:** Using a real world data set from Wikipedia containing all the Arrondissments in Paris<a name="part1"></a>

Properties of the arrondissments report that is useful for the analysis:
* NAME = Neighbourhood
* CAR = Arrondissement_Num
* LAR = French_Name
* Geometry_X = Latitude
* Geometry_Y = Longitude

Data from Open|DATA France: https://opendata.paris.fr/explore/dataset/arrondissements/table/?dataChart





In [36]:
# Download the dataset and read it into a pandas dataframe.

# The Arrondissements dataset was downloaded from Paris|DATA:  https://opendata.paris.fr/explore/dataset/arrondissements/table/?dataChart
# Then placed on the GitHub repo for the project.
# https://raw.githubusercontent.com/AR-data-science/Coursera_Capstone/master/Arrondissements_.csv

paris = pd.read_csv('https://raw.githubusercontent.com/ethansu1992/Coursera_Capstone/master/arrondissements.csv')
paris

Unnamed: 0,CAR,NAME,NSQAR,CAR.1,CARINSEE,LAR,NSQCO,SURFACE,PERIMETRE,Geometry_X,Geometry_Y
0,3,Temple,750000003,3,3,3eme Ardt,750001537,1170882828,4519264,48.862872,2.360001
1,19,Buttes-Chaumont,750000019,19,19,19eme Ardt,750001537,6792651129,11253182,48.887076,2.384821
2,14,Observatoire,750000014,14,14,14eme Ardt,750001537,5614877309,10317483,48.829245,2.326542
3,10,Entrepot,750000010,10,10,10eme Ardt,750001537,2891739442,6739375,48.87613,2.360728
4,12,Reuilly,750000012,12,12,12eme Ardt,750001537,16314782637,24089666,48.834974,2.421325
5,16,Passy,750000016,16,16,16eme Ardt,750001537,16372542129,17416110,48.860392,2.261971
6,11,Popincourt,750000011,11,11,11eme Ardt,750001537,3665441552,8282012,48.859059,2.380058
7,2,Bourse,750000002,2,2,2eme Ardt,750001537,991153745,4554104,48.868279,2.342803
8,4,Hotel-de-Ville,750000004,4,4,4eme Ardt,750001537,1600585632,5420908,48.854341,2.35763
9,17,Batignolles-Monceau,750000017,17,17,17eme Ardt,750001537,5668834504,10775580,48.887327,2.306777


In [37]:
# Rename the necessary columns 'Geometry_X and Geometry_Y' etc...

# District : name of the central District for the Arrondissement
# Arrondissement : the Arrondissement or district number which is used to identify it
# Arrondissement_Fr : the descriptive French label for each Arrondissement

paris.rename(columns={'NAME': 'Neighborhood ', 'CAR.1': 'Arrondissement_Num', 'Geometry_X': 'Latitude', 'Geometry_Y': 'Longitude',  'LAR': 'French_Name'}, inplace=True)
paris

Unnamed: 0,CAR,Neighborhood,NSQAR,Arrondissement_Num,CARINSEE,French_Name,NSQCO,SURFACE,PERIMETRE,Latitude,Longitude
0,3,Temple,750000003,3,3,3eme Ardt,750001537,1170882828,4519264,48.862872,2.360001
1,19,Buttes-Chaumont,750000019,19,19,19eme Ardt,750001537,6792651129,11253182,48.887076,2.384821
2,14,Observatoire,750000014,14,14,14eme Ardt,750001537,5614877309,10317483,48.829245,2.326542
3,10,Entrepot,750000010,10,10,10eme Ardt,750001537,2891739442,6739375,48.87613,2.360728
4,12,Reuilly,750000012,12,12,12eme Ardt,750001537,16314782637,24089666,48.834974,2.421325
5,16,Passy,750000016,16,16,16eme Ardt,750001537,16372542129,17416110,48.860392,2.261971
6,11,Popincourt,750000011,11,11,11eme Ardt,750001537,3665441552,8282012,48.859059,2.380058
7,2,Bourse,750000002,2,2,2eme Ardt,750001537,991153745,4554104,48.868279,2.342803
8,4,Hotel-de-Ville,750000004,4,4,4eme Ardt,750001537,1600585632,5420908,48.854341,2.35763
9,17,Batignolles-Monceau,750000017,17,17,17eme Ardt,750001537,5668834504,10775580,48.887327,2.306777


In [38]:
# Clean up the dataset to remove unnecessary columns.
# Some of the columns are for mapping software - not required here.

paris.drop(['NSQAR','CARINSEE','CARINSEE','NSQCO','SURFACE', 'PERIMETRE','CAR' ], axis=1, inplace=True)
paris


Unnamed: 0,Neighborhood,Arrondissement_Num,French_Name,Latitude,Longitude
0,Temple,3,3eme Ardt,48.862872,2.360001
1,Buttes-Chaumont,19,19eme Ardt,48.887076,2.384821
2,Observatoire,14,14eme Ardt,48.829245,2.326542
3,Entrepot,10,10eme Ardt,48.87613,2.360728
4,Reuilly,12,12eme Ardt,48.834974,2.421325
5,Passy,16,16eme Ardt,48.860392,2.261971
6,Popincourt,11,11eme Ardt,48.859059,2.380058
7,Bourse,2,2eme Ardt,48.868279,2.342803
8,Hotel-de-Ville,4,4eme Ardt,48.854341,2.35763
9,Batignolles-Monceau,17,17eme Ardt,48.887327,2.306777


In [39]:
# Check the shape of the dataframe
paris.shape

(20, 5)

In [40]:
# Retrieve the Latitude and Longitude for Paris
from geopy.geocoders import Nominatim 

address = 'Paris'

# Define the user_agent as Paris_explorer
geolocator = Nominatim(user_agent="Paris_explorer")

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geographical coordinates of Paris France are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Paris France are 48.8566969, 2.3514616.


In [42]:
# create map of Paris using the above latitude and longitude values
map_paris = folium.Map(location=[latitude, longitude], zoom_start=12)


# add markers to map
for lat, lng, label in zip(paris['Latitude'], paris['Longitude'], paris['French_Name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=25,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.3,
        parse_html=False).add_to(map_paris)  
    
map_paris

###**Part 2:** Data from Foursquare API to explore the Arrondissements of Paris (Neighbourhoods)<a name="part2"></a>

Data Analysis and Location Data:

* Foursquare location data will be leveraged to explore or compare districts around Paris.

* Data manipulation and analysis to derive subsets of the initial data.

* Identifying the high traffic areas using data visualisation and tatistical nalysis.

In [43]:
CLIENT_ID = 'KOSN3H1ZVDEB3PWLAF3FWM0FF0EP33TXCUWXJHKX2ILFXIKZ' # your Foursquare ID
CLIENT_SECRET = 'QSUVFEHKECE2Y30SWKDAFMUN0B10S2EDJECNWI2SVZKNHTXJ' # your Foursquare Secret
VERSION = '20200517' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KOSN3H1ZVDEB3PWLAF3FWM0FF0EP33TXCUWXJHKX2ILFXIKZ
CLIENT_SECRET:QSUVFEHKECE2Y30SWKDAFMUN0B10S2EDJECNWI2SVZKNHTXJ


In [44]:
# Explore the first Neighborhood in our dataframe.
# Get the Neighborhood's French name.

paris.loc[0, 'French_Name']
paris.loc[0, 'French_Name']

'3eme Ardt'

In [45]:
# Get the Neighborhood's latitude and longitude values.

neighborhood_latitude = paris.loc[0, 'Latitude'] # Neighborhood latitude value
neighborhood_longitude = paris.loc[0, 'Longitude'] # Neighborhood longitude value

neighborhood_name = paris.loc[0, 'French_Name'] # Neighborhood name

print('Latitude and longitude values of the neighborhood {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of the neighborhood 3eme Ardt are 48.86287238, 2.3600009859999997.


In [46]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # displays the URL

'https://api.foursquare.com/v2/venues/explore?&client_id=KOSN3H1ZVDEB3PWLAF3FWM0FF0EP33TXCUWXJHKX2ILFXIKZ&client_secret=QSUVFEHKECE2Y30SWKDAFMUN0B10S2EDJECNWI2SVZKNHTXJ&v=20200517&ll=48.86287238,2.3600009859999997&radius=500&limit=100'

In [47]:
# Send the GET request and examine the resutls

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ec1025469babe001b8cc523'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4d974096a2c654814aa6d353-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/deli_',
          'suffix': '.png'},
         'id': '4bf58dd8d48988d1c5941735',
         'name': 'Sandwich Place',
         'pluralName': 'Sandwich Places',
         'primary': True,
         'shortName': 'Sandwiches'}],
       'id': '4d974096a2c654814aa6d353',
       'location': {'address': '57 rue de Bretagne',
        'cc': 'FR',
        'city': 'Paris',
        'country': 'France',
        'distance': 123,
        'formattedAddress': ['57 rue de Bretagne', '75003 Paris', 'France'],
        'labeledLatLngs': [{'label': 'display',
          'lat': 48.86391016055883,
          'lng'

### Define a function that extracts the category of the venue

In [0]:
# define a function that extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [49]:
# clean the json and structure it into a pandas dataframe.

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(20)

  """


Unnamed: 0,name,categories,lat,lng
0,Mmmozza,Sandwich Place,48.86391,2.360591
1,Chez Alain Miam Miam,Sandwich Place,48.862369,2.36195
2,Fromagerie Jouannault,Cheese Shop,48.862947,2.36253
3,Square du Temple,Park,48.864475,2.360816
4,Marché des Enfants Rouges,Farmers Market,48.862806,2.361996
5,Chez Alain Miam Miam,Sandwich Place,48.862781,2.362064
6,Okomusu,Okonomiyaki Restaurant,48.861453,2.360879
7,Le Burger Fermier des Enfants Rouges,Burger Joint,48.862831,2.362073
8,Hôtel Jules & Jim,Hotel,48.863496,2.357395
9,SoMa,Japanese Restaurant,48.861511,2.362146


In [50]:
# Check how many venues there are in 3eme Ardt within a radius of 500 meters

print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

79 venues were returned by Foursquare.
