This is the notebook for the final project of IBM Data Science Capstone course. <br/>
The objectif is to find an ideal location for a new bakery in Haut-de-Seine, France.

## Data Acquisition

In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes
#=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


### At first, we will build our base data frame using BeautifulSoup API

In [4]:
# Load the HTML page to BeautifulSoup API

website_url = requests.get("https://fr.wikipedia.org/wiki/Hauts-de-Seine").text
soup = BeautifulSoup(website_url,'lxml')

source = requests.get('https://fr.wikipedia.org/wiki/Hauts-de-Seine').text
soup = BeautifulSoup(source, 'html5lib') 

In [5]:
# Fetch the Postal Code, Name, Resident Number and Incoming of each neighbourhood
# Remove special characters like white space and currency symbol in Resident Number and Incoming
# Convert Resident Number and Incoming to Integer type
# Build our base dataframe for analysis

postal_codes_list = []
neighbourhood_list = []
resident_number_list = []
incoming_list = []

import re

for postal_code in soup.find("center").table.tbody.find_all(align="center"):
    postal_codes_list.append(postal_code.text)
    
    name = postal_code.next_sibling.next_sibling
    neighbourhood_list.append(name.text)
    
    resident_number = name.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling
    resident_number_list.append(int(re.sub(r"\s+", "", resident_number.text)))

    incoming = resident_number.next_sibling.next_sibling
    incoming_list.append(int(re.sub(r"\s+", "", incoming.text[:-2])))

df = pd.DataFrame(list(zip(postal_codes_list, neighbourhood_list, resident_number_list, incoming_list)), columns = ['Postal Code', 'Neighbourhood', 'Resident Number', 'Incoming'])

In [6]:
# Check the dataframe
df.head()

Unnamed: 0,Postal Code,Neighbourhood,Resident Number,Incoming
0,92002,Antony,61711,43464
1,92004,Asnières-sur-Seine,86512,33939
2,92007,Bagneux,39487,28286
3,92009,Bois-Colombes,28043,37353
4,92012,Boulogne-Billancourt,117931,40416


In [7]:
# Check the size
df.shape

(36, 4)

In [8]:
# CHeck the dtypes
df.dtypes

Postal Code        object
Neighbourhood      object
Resident Number     int64
Incoming            int64
dtype: object

### Then, we use GeoPy to convert the address of each neighbourhood to Coordinate

In [9]:
# Use GeoPy to convert the address of each neighbourhood to Coordinates (Latitude and Longitude)
# Append the Latitude and Longitude to the base dataframe

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Haut-de-Seine")

df['Coordinates']= df['Neighbourhood'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df[['Latitude', 'Longitude']] = df['Coordinates'].apply(pd.Series)
df.drop(['Coordinates'], axis=1, inplace=True)

In [10]:
# Check the dataframe
df.head()

Unnamed: 0,Postal Code,Neighbourhood,Resident Number,Incoming,Latitude,Longitude
0,92002,Antony,61711,43464,48.753554,2.295942
1,92004,Asnières-sur-Seine,86512,33939,48.910595,2.289045
2,92007,Bagneux,39487,28286,47.240726,-0.09905
3,92009,Bois-Colombes,28043,37353,48.914827,2.267489
4,92012,Boulogne-Billancourt,117931,40416,48.835665,2.240206


In [11]:
# The code was removed by Watson Studio for sharing.

In [12]:
# Save dataframe as csv file to storage
project.save_data(data=df.to_csv(index=False),file_name='Haut-de-Seine.csv',overwrite=True)

{'asset_id': '51ae10f2-8215-498b-a637-35a3a3fab1f0',
 'bucket_name': 'capstonefinalproject-donotdelete-pr-03ohmlsfqcyk9h',
 'file_name': 'Haut-de-Seine.csv',
 'message': 'File saved to project storage.'}

### With the generated dataframe, we can pre-select our candidate neighborhoods based on resident number and incoming

In [13]:
# Find neighbourhoods whose resident numbers and incomings are both superior to the mean values 
resident_number_mean = df['Resident Number'].mean()
incoming_mean = df['Incoming'].mean()

selected_df = df[(df['Resident Number'] > resident_number_mean) & (df['Incoming'] > incoming_mean)]

In [14]:
selected_df

Unnamed: 0,Postal Code,Neighbourhood,Resident Number,Incoming,Latitude,Longitude
0,92002,Antony,61711,43464,48.753554,2.295942
25,92060,Neuilly-sur-Seine,60910,57830,48.884683,2.269566
27,92063,Rueil-Malmaison,78794,44787,48.87778,2.180283


Great, we have our pre-selected candidates: Antony, Neuilly-sur-Seine and Rueil-Malmaison.<br/>
Next step is to leverage the Foursquare API to explore the neighborhoods.

In [15]:
# The code was removed by Watson Studio for sharing.

In [16]:
# Function to get the nearby venues of a given neighbourhood
import urllib
LIMIT=100
def getNearbyVenues(name, latitude, longitude, radius=500, categoryIds=''):
    try:
        venues_list=[]

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, latitude, longitude, radius, LIMIT)

        if (categoryIds != ''):
            url = url + '&categoryId={}'
            url = url.format(categoryIds)

        # make the GET request
        response = requests.get(url).json()
        results = response["response"]['venues']

        # return only relevant information for each nearby venue
        for v in results:
            success = False
            try:
                category = v['categories'][0]['name']
                success = True
            except:
                pass

            if success:
                venues_list.append([(
                    name, 
                    latitude, 
                    longitude, 
                    v['name'], 
                    v['location']['lat'], 
                    v['location']['lng'],
                    v['categories'][0]['name']
                )])

        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
        nearby_venues.columns = ['Localidad', 
                  'Localidad Latitude', 
                  'Localidad Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    except:
        print(url)
        print(response)
        print(results)
        print(nearby_venues)

    return(nearby_venues)

In [17]:
# Function to add markers for given venues to map
def addToMap(df, color, existingMap):
    for lat, lng, local, venue, venueCat in zip(df['Venue Latitude'], df['Venue Longitude'], df['Localidad'], df['Venue'], df['Venue Category']):
        label = '{} ({}) - {}'.format(venue, venueCat, local)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=color,
            fill=True,
            fill_color=color,
            fill_opacity=0.7).add_to(existingMap)

In [18]:
neighbourhood_map_dict = {}

In [19]:
# Function to generate a map with Folium of the given neighbourhood
def generateNeighbourhoodMap(name):
    idx = selected_df[selected_df['Neighbourhood']==name].index[0]
    latitude = selected_df.loc[idx, 'Latitude']
    longitude = selected_df.loc[idx, 'Longitude']
    
    neighbourhood_map_dict[name] = folium.Map(location=[latitude, longitude], zoom_start=15)

In [20]:
# Function to retrive Bakery venues, Office venues and School venues of a given neighbourhood, mark them on the corresponding map.
# It also fill the dataframe with the sum of these venues
def exploreNeighbourhood(name):
    idx = selected_df[selected_df['Neighbourhood']==name].index[0]
    latitude = selected_df.loc[idx, 'Latitude']
    longitude = selected_df.loc[idx, 'Longitude']
    
    bakery_number_list = []
    bakery_office_list = []
    bakery_school_list = []

    venues_bakery = getNearbyVenues(name, latitude=latitude, longitude=longitude, radius=1000, categoryIds='4bf58dd8d48988d16a941735')
    venues_office = getNearbyVenues(name, latitude=latitude, longitude=longitude, radius=1000, categoryIds='4d4b7105d754a06375d81259')
    venues_school = getNearbyVenues(name, latitude=latitude, longitude=longitude, radius=1000, categoryIds='4d4b7105d754a06372d81259')
    
    bakery_number = venues_bakery.shape[0]
    office_number = venues_office.shape[0]
    school_number = venues_school.shape[0]

    selected_df.loc[idx,'Bakery Number'] = bakery_number
    selected_df.loc[idx,'Office Number'] = office_number
    selected_df.loc[idx,'School Number'] = school_number
    
    # update map
    addToMap(venues_bakery, 'red', neighbourhood_map_dict[name])
    addToMap(venues_office, 'green', neighbourhood_map_dict[name])
    addToMap(venues_school, 'blue', neighbourhood_map_dict[name])
    
    legend_html =   '''
                <div style="position: fixed; 
                            bottom: 50px; left: 50px; width: 100px; height: 120px; 
                            border:2px solid grey; z-index:9999; font-size:14px;
                            ">&nbsp; Cool Legend <br>
                              &nbsp; Bakery &nbsp; <i class="fa fa-map-marker fa-2x" style="color:red"></i><br>
                              &nbsp; School &nbsp; <i class="fa fa-map-marker fa-2x" style="color:blue"></i><br>
                              &nbsp; Office &nbsp; <i class="fa fa-map-marker fa-2x" style="color:green"></i><br>
                </div>
                ''' 

    neighbourhood_map_dict[name].get_root().html.add_child(folium.Element(legend_html))

### We will explore the neighborhood Antony:
1. Build a map focused on Antony using Folium <br/>
2. Fetch the bakeray venues, office venues and school venues using Foursquare API<br/>
3. Calculate the sum of the listed venues and append them to the base dataframe<br/>
4. Mark the listed venues on the map

In [21]:
# Explore the neighbourhood Antony
name='Antony'
generateNeighbourhoodMap(name)
exploreNeighbourhood(name)
neighbourhood_map_dict[name]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


### We will explore the neighborhood Neuilly-sur-Seine:
1. Build a map focused on Antony using Folium <br/>
2. Fetch the bakeray venues, office venues and school venues using Foursquare API<br/>
3. Calculate the sum of the listed venues and append them to the base dataframe<br/>
4. Mark the listed venues on the map

In [23]:
# Explore neighbourhood Neuilly-sur-Seine
name='Neuilly-sur-Seine'
generateNeighbourhoodMap(name)
exploreNeighbourhood(name)
neighbourhood_map_dict[name]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


### We will explore the neighborhood Rueil-Malmaison:
1. Build a map focused on Antony using Folium <br/>
2. Fetch the bakeray venues, office venues and school venues using Foursquare API<br/>
3. Calculate the sum of the listed venues and append them to the base dataframe<br/>
4. Mark the listed venues on the map

In [24]:
# Explore neighbourhood Rueil-Malmaison
name='Rueil-Malmaison'
generateNeighbourhoodMap(name)
exploreNeighbourhood(name)
neighbourhood_map_dict[name]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [25]:
# Display the dataframe with the number of the Office venues, School venues and Bakery venues
selected_df.head()

Unnamed: 0,Postal Code,Neighbourhood,Resident Number,Incoming,Latitude,Longitude,Bakery Number,Office Number,School Number
0,92002,Antony,61711,43464,48.753554,2.295942,10.0,44.0,33.0
25,92060,Neuilly-sur-Seine,60910,57830,48.884683,2.269566,48.0,50.0,28.0
27,92063,Rueil-Malmaison,78794,44787,48.87778,2.180283,5.0,40.0,10.0


## One can see that Rueil-Malmaison is a good choice for our bakery loation:
we get a huge number of resident number but very few bakeries. <br/>
Even if the office number and school number are not as many as the two others, it could be interesting to open a new bakery in the neighbourhood. <br/>
We are now checking this assumption with a scoring model.

### We are now constructing our scoring model to confirm our assumption.
The inputs are the number of residents, bakery venues, school venues and office venues.<br/>
We will check the weighted sum for each neighborhood and find the best scored one.

In [26]:
# Negative point as we want to avoid competitors
weight_bakery = -1

# School student could be our customers
weight_schools = 1

# Employees could be better customers as the depense would be in a lunch menu
weight_offices = 2

# Residents are also quite good customers, but we should consider that a family is composed with about 3 people
weight_residents = 2/3


In [27]:
# Calculate the weighted score and add the score to our dataframe
selected_df['Score'] = selected_df['Bakery Number'] * weight_bakery + selected_df['School Number'] * weight_schools + selected_df['Office Number'] * weight_offices + selected_df['Resident Number'] * weight_residents
selected_df = selected_df.sort_values(by=['Score'], ascending=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [28]:
selected_df

Unnamed: 0,Postal Code,Neighbourhood,Resident Number,Incoming,Latitude,Longitude,Bakery Number,Office Number,School Number,Score
27,92063,Rueil-Malmaison,78794,44787,48.87778,2.180283,5.0,40.0,10.0,52614.333333
0,92002,Antony,61711,43464,48.753554,2.295942,10.0,44.0,33.0,41251.666667
25,92060,Neuilly-sur-Seine,60910,57830,48.884683,2.269566,48.0,50.0,28.0,40686.666667


## Great ! The model gives the same conclusion as what we predicted: Rueil-Malmaison is the best choice for our new bakery.