# The Battle of Neighborhoods

This is the Capstone Project of the IBM Data Science Professional Certificate on Coursera.org. Link to the full courses: https://www.coursera.org/professional-certificates/ibm-data-science

__Author: Dam Quang Trung__

__Date of Publication: April, 2020__

*Alternative link (executable Folium map): https://nbviewer.jupyter.org/github/dqtvictory/Coursera_Capstone/blob/master/The%20Final%20Battle.ipynb*

### Background

Paris, France is where I live. This city is remarkably dynamic and diversified with lots of ethnic groups as well as religions and income classes. With millions of habitants within the city, plus over 3 times more of that in the suburbs, health care is one of the most important services in people's everyday life. France is the nation with top-notched health care system providing low-cost service of very high quality to its vast majority of population, where the Social Security takes care of 70% of nearly all medical-related fees, therefore health care is almost available to everybody.

However, due to the popularity of public health care, it is sometimes not very easy to get a next-day appointment with the doctor, or have one's wisdom tooth extracted within the week when it starts to cause pain. Hospitals can sometimes be filled with patients from all working classes that wealthier patients may desire more privacy they don't have. Thus, private medical service providers can tailor to the needs of these patients, by supplying beds and chambers with greater privacy and utilities, better meals and personal services, access to top-of-the-line equipments and medicine, and so on. It's worth noting that the national Social Security only pays for 70% of ***conventionned*** medical-related fees, meaning for example a doctor's consultation is conventionned at 23.00 euros, therefore one is reimbursed _23.00 * 0.7 = 16.10_ for each consultation. Of course this leaves 6.90 euros left to be paid for the conventionned price, plus more if the doctor sets his or her price higher.

### Project Objectives

This project seeks to explore the city of Paris, the capital of France, and its neighborhoods to find the best location(s) to start a business in providing high-end medical services. In other words, I would like to open a private clinic in Paris for patients with high income, where should I look in Paris?


### Who is this project for?

This project can be utilized by entrepreneurs in the medical care industry as a reference. The codes in this project notebook are open-sourced and free to be used elsewhere.


### Methodologies

The following methodologies are used in this project in order to achieve the above objectives:
+ Python programming language
+ Web scraping tool: BeautifulSoup
+ Machine learning techniques: K-mean clustering, Regression...
+ API from Foursquare, Google Maps
+ Python libraries: Numpy, Pandas, Matplotlib, Folium...


### Data Source

I am going to need the following data:
+ List of Paris's neighborhoods and population in 2016 from the website of the Nation Institute of Statistics and Economics Studies of France (INSEE). Link: https://www.insee.fr/fr/statistiques/4228434#consulter
+ Population's average income in 2006 from https://www.salairemoyen.com/
+ Geodata from Google Maps' API
+ Neighborhood exploration from Foursquare's Places API

In [1]:
import numpy as np
import pandas as pd

In [2]:
filename = 'base-ic-evol-struct-pop-2016.xls'

df = pd.read_excel(filename, skiprows=5)
df.head()

Unnamed: 0,IRIS,REG,DEP,UU2010,COM,LIBCOM,TRIRIS,GRD_QUART,LIBIRIS,TYP_IRIS,...,C16_F15P_CS4,C16_F15P_CS5,C16_F15P_CS6,C16_F15P_CS7,C16_F15P_CS8,P16_POP_FR,P16_POP_ETR,P16_POP_IMM,P16_PMEN,P16_PHORMEN
0,10010000,84,1,1000,1001,L'Abergement-Clémenciat,ZZZZZZ,100100,L'Abergement-Clémenciat (commune non irisée),Z,...,50.0,85.0,30.0,70.0,15.0,759.0,8.0,20.0,767.0,0.0
1,10020000,84,1,1000,1002,L'Abergement-de-Varey,ZZZZZZ,100200,L'Abergement-de-Varey (commune non irisée),Z,...,20.0,25.0,0.0,20.0,0.0,241.0,2.0,3.0,243.0,0.0
2,10040101,84,1,1302,1004,Ambérieu-en-Bugey,ZZZZZZ,100401,Les Perouses-Triangle d'Activité,H,...,99.457191,158.028018,59.452099,216.381754,195.032628,1614.595359,278.516847,308.098676,1557.106066,336.006141
3,10040102,84,1,1302,1004,Ambérieu-en-Bugey,ZZZZZZ,100401,Longeray-Gare,H,...,220.097097,354.962018,158.716105,351.399678,346.514489,3115.756603,430.453561,515.488156,3546.210164,0.0
4,10040201,84,1,1302,1004,Ambérieu-en-Bugey,ZZZZZZ,100402,Centre-St Germain-Vareilles,H,...,236.966388,407.311067,147.066671,529.37121,334.26271,3831.463866,268.539722,341.362897,4012.00198,88.001608


#### I'm actually not needing all of the columns as well as all rows, since this dataset contains the data of the entire French population in 2016. So let's trim down the data table.

In [3]:
df = df[df['DEP']=='75']  # Grabs all rows of Paris which belongs to the department 75
columns_df = ['LIBCOM','LIBIRIS','P16_POP']

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

df = df[columns_df]
df.head()

Unnamed: 0,LIBCOM,LIBIRIS,P16_POP
38042,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 1,976.094011
38043,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 2,185.216544
38044,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 3,237.057352
38045,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 4,2.999943
38046,Paris 1er Arrondissement,Tuileries,0.0


In [4]:
# Let's rename the columns so that it's more readable
new_names = ['District', 'Neighborhood', 'Population']
df.rename(columns={columns_df[i]:new_names[i] for i in range(3)}, inplace=True)

# And check the data table's bottom to make sure that all of Paris is loaded correctly
df.tail()

Unnamed: 0,District,Neighborhood,Population
39029,Paris 20e Arrondissement,Charonne 22,1780.140904
39030,Paris 20e Arrondissement,Charonne 23,2117.432119
39031,Paris 20e Arrondissement,Charonne 24,3068.788333
39032,Paris 20e Arrondissement,Charonne 25,2767.341351
39033,Paris 20e Arrondissement,Charonne 26,1849.224591


In [5]:
print("Data's shape before cleaning:", df.shape)

Data's shape before cleaning: (992, 3)


#### We also notice the following things:
+ The same neighborhoods are divided into many rows which is redundant.
+ Population data is float type and has decimal numbers which needs to be converted to integer
+ Some neighborhoods have zero population, which we aren't interested in

#### Let's take care of the above points to clean our data

In [6]:
# Remove zero population districts
paris_df = df.replace(0.0, np.nan)
paris_df.dropna(inplace=True)

paris_df.head()

Unnamed: 0,District,Neighborhood,Population
38042,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 1,976.094011
38043,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 2,185.216544
38044,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 3,237.057352
38045,Paris 1er Arrondissement,Saint-Germain l'Auxerrois 4,2.999943
38048,Paris 1er Arrondissement,Les Halles 1,2022.249227


In [7]:
# Combine rows to unify data from the same neighborhood

# First we iterate each row in our table and find the row's neighborhood's name. Almost all names in the neighborhood column end with a
# number, so we can easily grab the name by string slicing str[:-2] to exclude the number and a space. Then if our dataset is perfect,
# the few following rows should have the same current neighborhood, until the next one then we stop. We combine the iterated rows by
# summing the total neighborhood's population to the first row, then we drop the rest of the rows from the 2nd row. Repeat the process
# until the last possible neighborhood.

paris_df.reset_index(inplace=True, drop=True)

i = paris_df.index[0]  # First pointer

while i <= paris_df.index[-1]:
    neighborhood_name = paris_df.loc[i, 'Neighborhood'][:-2]
    paris_df.loc[i, 'Neighborhood'] = neighborhood_name
    sum_pop = paris_df.loc[i, 'Population']
    j = i + 1  # Second pointer
    
    if i < paris_df.index[-1]:   
        while j <= paris_df.index[-1] and paris_df.loc[j, 'Neighborhood'].startswith(neighborhood_name):
            sum_pop += paris_df.loc[j, 'Population']
            paris_df.drop(j, axis=0, inplace=True)
            j += 1
    
    paris_df.loc[i, 'Population'] = sum_pop
    i = j
    
paris_df

Unnamed: 0,District,Neighborhood,Population
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401.367849
4,Paris 1er Arrondissement,Les Halles,8868.773185
9,Paris 1er Arrondissement,Palais Royal,3239.424017
12,Paris 1er Arrondissement,Place Vendome,2742.434949
14,Paris 2e Arrondissement,Gaillon,1456.111387
17,Paris 2e Arrondissement,Vivienne,3046.086964
19,Paris 2e Arrondissement,Mail,6383.112086
23,Paris 2e Arrondissement,Bonne Nouvelle,9374.689562
28,Paris 3e Arrondissement,Arts et Metiers,9722.069984
33,Paris 3e Arrondissement,Enfants Rouges,8996.198418


In [8]:
# Since there are some "neigborhoods" that have fewer than 1000 habitants, we should not consider these real. In fact,
# if we try looking up one of these rows on Google Maps for example, like "Bois de Boulogne", we see that this is
# actually a very large park to the west of Paris, hence impossible to be considered a valid neighborhood. Let's grab
# only rows that have population of more than 1000.

paris_df = paris_df[paris_df['Population'] >= 1000]

# Finally, convert the population data into correct dtype which is integer

paris_df['Population'] = paris_df['Population'].apply(round)
paris_df.reset_index(drop=True, inplace=True)
paris_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  paris_df['Population'] = paris_df['Population'].apply(round)


Unnamed: 0,District,Neighborhood,Population
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401
1,Paris 1er Arrondissement,Les Halles,8869
2,Paris 1er Arrondissement,Palais Royal,3239
3,Paris 1er Arrondissement,Place Vendome,2742
4,Paris 2e Arrondissement,Gaillon,1456
5,Paris 2e Arrondissement,Vivienne,3046
6,Paris 2e Arrondissement,Mail,6383
7,Paris 2e Arrondissement,Bonne Nouvelle,9375
8,Paris 3e Arrondissement,Arts et Metiers,9722
9,Paris 3e Arrondissement,Enfants Rouges,8996


In [9]:
paris_df.dtypes

District        object
Neighborhood    object
Population       int64
dtype: object

#### Let's save the first cleaned data so that we don't have to reload the large Excel file again each time we start this notebook

In [10]:
paris_df.to_csv('Paris neighborhoods.csv', index=False)

# Battle of Neighborhoods (cont'd)

#### Let's continue our data journey by loading up the processed data table in the CSV file

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

filename = 'Paris neighborhoods.csv'

paris_df = pd.read_csv(filename)
paris_df.head()

Unnamed: 0,District,Neighborhood,Population
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401
1,Paris 1er Arrondissement,Les Halles,8869
2,Paris 1er Arrondissement,Palais Royal,3239
3,Paris 1er Arrondissement,Place Vendome,2742
4,Paris 2e Arrondissement,Gaillon,1456


#### Expand the table with the corresponding coordinates of each neighborhood, directly from Google Maps' API

In [2]:
import requests

API_key = 'AIzaSyBm0pIjblcVL6P_9qpxO81mxMx2iodbOWA'  # My OLD Google Maps' API, please don't bother try using this key

for i in paris_df.index:
    name = paris_df.loc[i, 'Neighborhood'].replace(' ', '+')  # Google Maps doesn't like space in its request URL, instead a plus sign
    url = f'https://maps.googleapis.com/maps/api/geocode/json?address=Quartier+{name},+Paris&key={API_key}'
    results = requests.get(url).json()
    lat = results['results'][0]['geometry']['location']['lat']
    lng = results['results'][0]['geometry']['location']['lng']
    paris_df.loc[i, 'Latitude'] = lat
    paris_df.loc[i, 'Longitude'] = lng
    
paris_df.head()

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401,48.861562,2.333719
1,Paris 1er Arrondissement,Les Halles,8869,48.862335,2.344736
2,Paris 1er Arrondissement,Palais Royal,3239,48.865221,2.335364
3,Paris 1er Arrondissement,Place Vendome,2742,48.867447,2.329434
4,Paris 2e Arrondissement,Gaillon,1456,48.869662,2.333622


In [3]:
# Let's visualize all of Parisian neighborhoods on the map

import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

# Get Paris's latitude and longitude
url = f'https://maps.googleapis.com/maps/api/geocode/json?address=Paris,+France&key={API_key}'
results = requests.get(url).json()
lat_paris = results['results'][0]['geometry']['location']['lat']
lng_paris = results['results'][0]['geometry']['location']['lng']

# Generate a map using Folium
map_paris = folium.Map(location=[lat_paris, lng_paris], zoom_start=11, tiles='Stamen Toner')

# Set color scheme for each district
paris_districts = np.unique(paris_df['District'].values)
num_districts = len(paris_districts)

x = np.arange(num_districts)
ys = [i + x + (i*x)**2 for i in range(num_districts)]
colors_array = cm.rainbow(np.linspace(0, 1, num_districts))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
for lat, lon, district, neigh in zip(paris_df['Latitude'], paris_df['Longitude'], paris_df['District'], paris_df['Neighborhood']):
    label = folium.Popup(f"{neigh}, {district[:-15]}", parse_html=True)
    color = rainbow[np.where(paris_districts==district)[0][0]]
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.7).add_to(map_paris)
       
map_paris

#### But wait! There is one odd point to the city's south!

Apparently Google has mistaken when we searched for "Quartier Roquette, Paris" just now. Let's be more specific and search for ___Quartier de la Roquette, Paris 11e___ instead.

In [4]:
keyword = 'Quartier+de+la+Roquette,+Paris+11e'
url = f'https://maps.googleapis.com/maps/api/geocode/json?address={keyword}&key={API_key}'

results = requests.get(url).json()
lat_roquette = results['results'][0]['geometry']['location']['lat']
lng_roquette = results['results'][0]['geometry']['location']['lng']

print("New coordinates: (", lat_roquette, ",", lng_roquette, ")")

New coordinates: ( 48.8578217 , 2.3801945 )


In [5]:
# Locate the correct row in the data table

paris_df[paris_df['Neighborhood'] == 'Roquette']

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude
42,Paris 11e Arrondissement,Roquette,46429,48.804372,2.381166


In [6]:
# Replace with the right coordinates
paris_df.loc[42, ['Latitude', 'Longitude']] = [lat_roquette, lng_roquette]

# Let's try again
map_paris = folium.Map(location=[lat_paris, lng_paris], zoom_start=12, tiles='Stamen Toner')

for lat, lon, district, neigh in zip(paris_df['Latitude'], paris_df['Longitude'], paris_df['District'], paris_df['Neighborhood']):
    label = folium.Popup(f"{neigh}, {district[:-15]}", parse_html=True)
    color = rainbow[np.where(paris_districts==district)[0][0]]
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.7).add_to(map_paris)
       
map_paris

In [7]:
# Save the coordinates to a CSV file so we don't have to call the Google Maps' API again

paris_df.to_csv('Paris neighborhoods lat lng.csv', index=False)

#### We have just geo-identified all candidates for our future clinic. Now it's time to scrape the web to find the income data of the population

- Link to the main page: https://www.salairemoyen.com/en/departement-75-Paris.html
- Link to the 1st district's data: https://www.salairemoyen.com/en/salaire-ville-75101-Paris_1er_Arrondissement.html which has a pattern like the rest of the districts

We are interested in the mean income of each neighborhood as well as that of the top 10% richest, which is potentially our future clients.

In [1]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

filename = 'Paris neighborhoods lat lng.csv'
paris_df = pd.read_csv(filename)

for i in range(0, 80, 4):
    district = paris_df.loc[i, "District"]
    district = district.replace(" ", "_")
    postal_code = 75101 + i//4
    
    url = f"https://www.salairemoyen.com/en/salaire-ville-{postal_code}-{district}.html"
    
    page = requests.get(url)
    soup = bs(page.text, 'html.parser')
    
    mean_income_text = soup.select('span')[11].get_text().replace(" ", "")
    mean_income = int(mean_income_text[:4])
    
    top_income_text = soup.select('span')[16].get_text()
    top_income = float(top_income_text[3:6]) * mean_income
    
    paris_df.loc[i:i+3, "Mean Income"] = mean_income
    
    paris_df.loc[i:i+3, "Top 10%'s Income"] = top_income

paris_df.head()

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude,Mean Income,Top 10%'s Income
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401,48.861562,2.333719,2994.0,8682.6
1,Paris 1er Arrondissement,Les Halles,8869,48.862335,2.344736,2994.0,8682.6
2,Paris 1er Arrondissement,Palais Royal,3239,48.865221,2.335364,2994.0,8682.6
3,Paris 1er Arrondissement,Place Vendome,2742,48.867447,2.329434,2994.0,8682.6
4,Paris 2e Arrondissement,Gaillon,1456,48.869662,2.333622,2778.0,7500.6


In [2]:
# Computations in order to leverage each neighborhood's population strength: total neighborhood's income, and
# top 10%'s total income. The latter naively assumes that the riches are distributed evenly within a district.

paris_df["Population Income"] = paris_df['Mean Income'] * paris_df['Population']
paris_df["The Richest's Total Income"] = paris_df["Top 10%'s Income"] * paris_df['Population'] * 0.1

paris_df.head()

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude,Mean Income,Top 10%'s Income,Population Income,The Richest's Total Income
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401,48.861562,2.333719,2994.0,8682.6,4194594.0,1216432.26
1,Paris 1er Arrondissement,Les Halles,8869,48.862335,2.344736,2994.0,8682.6,26553786.0,7700597.94
2,Paris 1er Arrondissement,Palais Royal,3239,48.865221,2.335364,2994.0,8682.6,9697566.0,2812294.14
3,Paris 1er Arrondissement,Place Vendome,2742,48.867447,2.329434,2994.0,8682.6,8209548.0,2380768.92
4,Paris 2e Arrondissement,Gaillon,1456,48.869662,2.333622,2778.0,7500.6,4044768.0,1092087.36


#### It's time to make use of the Foursquare's Places API. Let's find out how fierce the competition is in the market.

In [3]:
# Foursquare's API Credentials. I have changed these info by the time of publishing this notebook

CLIENT_ID = 'VXGDNSQYKF4LLMYU30PZCSSLWCPV1WJG4KX1WZRCUZWZIZ4G'
CLIENT_SECRET = 'LFMWODACB2BWMNZKF23J00ZC0SMNDW5ZBMGS1ELWDYTPZRGX'
VERSION = '20200330'

In [4]:
# Function that extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [6]:
LIMIT = 100
radius = 500
search_query = 'clinique'  # which stands for clinic in French

search_df = pd.DataFrame()

for i in range(80):
    lat = paris_df.loc[i, 'Latitude']
    lng = paris_df.loc[i, 'Longitude']
    neigh = paris_df.loc[i, 'Neighborhood']
    
    url = f'https://api.foursquare.com/v2/venues/search?client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&ll={lat},{lng}&v={VERSION}&query={search_query}&radius={radius}&limit={LIMIT}'
    results = requests.get(url).json()
    
    results_df = pd.json_normalize(results['response']['venues'])
    results_df['Neighborhood'] = np.array([neigh] * results_df.shape[0])
    
    if search_df.size == 0:
        search_df = results_df
    else:
        search_df = pd.concat([search_df, results_df])
    
search_df.head()

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.crossStreet,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,Neighborhood,venuePage.id,location.neighborhood
0,4ddf37391f6ed1828c4fe872,Clinique du Louvre,"[{'id': '4bf58dd8d48988d196941735', 'name': 'H...",v-1585692229,False,17 rue des Prêtres Saint-Germain l'Auxerrois,Place du Louvre,48.859652,2.341171,"[{'label': 'display', 'lat': 48.85965208323426...",585.0,75001.0,FR,Paris,Île-de-France,France,[17 rue des Prêtres Saint-Germain l'Auxerrois ...,Saint-Germain l'Auxerrois,,
0,4d566932cff7721ebc47b5f5,Clinique Vétérinaire du Dr Frantz Cappé,"[{'id': '5032897c91d4c4b30a586d69', 'name': 'P...",v-1585692229,False,14 Rue Bertin Poirée,,48.858797,2.344811,"[{'label': 'display', 'lat': 48.8587974959729,...",393.0,75001.0,FR,Paris,Île-de-France,France,"[14 Rue Bertin Poirée, 75001 Paris, France]",Les Halles,150516921.0,
1,4babc96cf964a52071c93ae3,Clinique Vétérinaire du Docteur Gachet,"[{'id': '4bf58dd8d48988d104941735', 'name': 'M...",v-1585692229,False,32 Rue Etienne Marcel,,48.864092,2.348468,"[{'label': 'display', 'lat': 48.86409202068506...",336.0,75002.0,FR,Paris,Île-de-France,France,"[32 Rue Etienne Marcel, 75002 Paris, France]",Les Halles,,
2,4ddf37391f6ed1828c4fe872,Clinique du Louvre,"[{'id': '4bf58dd8d48988d196941735', 'name': 'H...",v-1585692229,False,17 rue des Prêtres Saint-Germain l'Auxerrois,Place du Louvre,48.859652,2.341171,"[{'label': 'display', 'lat': 48.85965208323426...",396.0,75001.0,FR,Paris,Île-de-France,France,[17 rue des Prêtres Saint-Germain l'Auxerrois ...,Les Halles,,
3,4c88ae740f3c236a0b8ef45c,Clinique Bachaumont,[],v-1585692229,False,,,48.865929,2.346585,"[{'label': 'display', 'lat': 48.86592859604439...",422.0,,FR,,,France,[France],Les Halles,,


In [7]:
# How many search results did we get?
search_df.shape

(237, 20)

In [8]:
# Now apply the fonction defined above to filter out each row's category name from the categories column

search_df['categories'] = search_df.apply(get_category_type, axis=1)
get_columns = ['name', 'categories', 'Neighborhood']
search_df = search_df[get_columns]

search_df.head()

Unnamed: 0,name,categories,Neighborhood
0,Clinique du Louvre,Hospital,Saint-Germain l'Auxerrois
0,Clinique Vétérinaire du Dr Frantz Cappé,Pet Service,Les Halles
1,Clinique Vétérinaire du Docteur Gachet,Medical Center,Les Halles
2,Clinique du Louvre,Hospital,Les Halles
3,Clinique Bachaumont,,Les Halles


In [9]:
# We observe that there are some categories that we don't really want like Pet Service. Let's see what and how many there are

search_df.reset_index(drop=True, inplace=True)
search_df.groupby(by='categories').count()

Unnamed: 0_level_0,name,Neighborhood
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
Acupuncturist,1,1
Animal Shelter,5,5
Arts & Crafts Store,2,2
Assisted Living,1,1
Bike Shop,1,1
Clothing Store,1,1
Cosmetics Shop,3,3
Dentist's Office,11,11
Design Studio,1,1
Doctor's Office,16,16


In [10]:
# We are keeping the following categories as these are more relevant
categories_kept = ["Dentist's Office", "Doctor's Office	", "Hospital", "Medical Center", "Mental Health Office"]
rows_mask = list(map(lambda x: x in categories_kept, search_df['categories'].to_list()))
search_df = search_df[rows_mask]

print("New shape of our data after filtering out:", search_df.shape)

New shape of our data after filtering out: (80, 3)


In [11]:
# Now let's input the competition count into our initial data table

count_df = search_df.groupby(by='Neighborhood').count()

for i in range(80):
    neigh = paris_df.loc[i, 'Neighborhood']
    if neigh in count_df.index:
        paris_df.loc[i, 'Competition'] = count_df.loc[neigh, 'name']
    else:
        paris_df.loc[i, 'Competition'] = 0
        
paris_df.head()

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude,Mean Income,Top 10%'s Income,Population Income,The Richest's Total Income,Competition
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401,48.861562,2.333719,2994.0,8682.6,4194594.0,1216432.26,1.0
1,Paris 1er Arrondissement,Les Halles,8869,48.862335,2.344736,2994.0,8682.6,26553786.0,7700597.94,2.0
2,Paris 1er Arrondissement,Palais Royal,3239,48.865221,2.335364,2994.0,8682.6,9697566.0,2812294.14,0.0
3,Paris 1er Arrondissement,Place Vendome,2742,48.867447,2.329434,2994.0,8682.6,8209548.0,2380768.92,0.0
4,Paris 2e Arrondissement,Gaillon,1456,48.869662,2.333622,2778.0,7500.6,4044768.0,1092087.36,0.0


In [12]:
# Final save before finally modelling

paris_df.to_csv('Paris neighborhoods final.csv', index=False)

#### Now our data table is ready for modelling

In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

filename = 'Paris neighborhoods final.csv'

paris_df = pd.read_csv(filename)
paris_df.head()

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude,Mean Income,Top 10%'s Income,Population Income,The Richest's Total Income,Competition
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401,48.861562,2.333719,2994.0,8682.6,4194594.0,1216432.26,1.0
1,Paris 1er Arrondissement,Les Halles,8869,48.862335,2.344736,2994.0,8682.6,26553786.0,7700597.94,2.0
2,Paris 1er Arrondissement,Palais Royal,3239,48.865221,2.335364,2994.0,8682.6,9697566.0,2812294.14,0.0
3,Paris 1er Arrondissement,Place Vendome,2742,48.867447,2.329434,2994.0,8682.6,8209548.0,2380768.92,0.0
4,Paris 2e Arrondissement,Gaillon,1456,48.869662,2.333622,2778.0,7500.6,4044768.0,1092087.36,0.0


#### Clustering neighborhoods

In [2]:
from sklearn.cluster import KMeans

data = paris_df[["Population", "Population Income", "The Richest's Total Income", "Competition"]]

# Number of clusters
k = 5

kmeans = KMeans(n_clusters=k, random_state=0).fit(data)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2,
       3, 2, 1, 2, 1, 3, 2, 2, 2, 3, 1, 2, 2, 1, 1, 2, 1, 1, 3, 3, 3, 3,
       3, 0, 2, 1, 2, 0, 0, 1, 1, 1, 3, 0, 4, 0, 0, 0, 4, 0, 3, 3, 3, 0,
       0, 0, 0, 0, 1, 1, 3, 1, 3, 1, 1, 3, 3, 0], dtype=int32)

In [3]:
# Label the rows accordingly
paris_df['Cluster Label'] = kmeans.labels_
paris_df.head()

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude,Mean Income,Top 10%'s Income,Population Income,The Richest's Total Income,Competition,Cluster Label
0,Paris 1er Arrondissement,Saint-Germain l'Auxerrois,1401,48.861562,2.333719,2994.0,8682.6,4194594.0,1216432.26,1.0,2
1,Paris 1er Arrondissement,Les Halles,8869,48.862335,2.344736,2994.0,8682.6,26553786.0,7700597.94,2.0,2
2,Paris 1er Arrondissement,Palais Royal,3239,48.865221,2.335364,2994.0,8682.6,9697566.0,2812294.14,0.0,2
3,Paris 1er Arrondissement,Place Vendome,2742,48.867447,2.329434,2994.0,8682.6,8209548.0,2380768.92,0.0,2
4,Paris 2e Arrondissement,Gaillon,1456,48.869662,2.333622,2778.0,7500.6,4044768.0,1092087.36,0.0,2


In [4]:
# Visualize the clusters on the map to see if we can identify any geographical properties of each

import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
import requests

# Get Paris's latitude and longitude
API_key = 'AIzaSyBm0pIjblcVL6P_9qpxO81mxMx2iodbOWA'
url = f'https://maps.googleapis.com/maps/api/geocode/json?address=Paris,+France&key={API_key}'
results = requests.get(url).json()
lat_paris = results['results'][0]['geometry']['location']['lat']
lng_paris = results['results'][0]['geometry']['location']['lng']

# Generate a map using Folium
map_paris = folium.Map(location=[lat_paris, lng_paris], zoom_start=12, tiles='Stamen Toner')

# Set color scheme for each district
labels = np.unique(kmeans.labels_)
num_labels = len(labels)

x = np.arange(num_labels)
ys = [i + x + (i*x)**2 for i in range(num_labels)]
colors_array = cm.rainbow(np.linspace(0, 1, num_labels))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
for lat, lon, cluster, neigh in zip(paris_df['Latitude'], paris_df['Longitude'], paris_df['Cluster Label'], paris_df['Neighborhood']):
    label = folium.Popup(f"{neigh} - Cluster {cluster}", parse_html=True)
    color = rainbow[cluster]
    folium.CircleMarker(
        [lat, lon],
        radius=6,
        popup=label,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.7).add_to(map_paris)
       
map_paris

#### First observation

We can clearly see that cluster 2 consists mainly the central neighborhoods where income and wealthiness could be higher than the others, whereas cluster 0 tends to be near the outer ring of the city, which is in fact more populated but less wealthy. The other clusters are spreaded and need more analysis.

In [5]:
# Let's look at the mean of each criteria based on clusters

paris_df.groupby(by="Cluster Label").mean().round(2)

Unnamed: 0_level_0,Population,Latitude,Longitude,Mean Income,Top 10%'s Income,Population Income,The Richest's Total Income,Competition
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,56817.29,48.86,2.33,2449.07,6181.52,134486600.0,33499480.97,1.57
1,23911.83,48.86,2.35,2484.5,6494.97,54982040.0,13891773.49,0.61
2,7555.3,48.86,2.34,3024.67,8634.9,21938660.0,6088923.29,0.77
3,36108.12,48.86,2.35,2641.69,7247.36,87026080.0,22933832.55,1.19
4,79118.5,48.84,2.28,3146.5,8660.55,245837100.0,66732813.7,2.5


Our first observation is correct! Here are my thoughts from the mean table above:
+ Cluster 0: high population, low income, not wealthy, moderate competition. _Not recommended_
+ Cluster 1: moderate population, low income, not wealthy, low competition. _Not recommended_
+ Cluster 2: low population, high income, very wealthy, low competition. ___Highly recommended___
+ Cluster 3: moderate population, moderate income, somewhat wealthy, moderate competition. _Not recommended_
+ Cluster 4: very high population, high income, very wealthy, high competition. ___Recommended___

Now we know that clusters 2 and 4 are the most wealthy, which have the highest income of the rich. Since their income is 8600 euros a month, they can surely afford the medical cost that we would charge. Since the 10% richest of population is my target clientele, I'm most interested in their total income altogether, i.e. this is my prefered criteria of picking the best neighborhood for my future clinic

In [6]:
# Let's grab only the rows of clusters 2 and 4

rows_mask = list(map(lambda x: x in [2,4], paris_df['Cluster Label'].to_list()))
paris_df[rows_mask].sort_values("The Richest's Total Income", ascending=False).head()

Unnamed: 0,District,Neighborhood,Population,Latitude,Longitude,Mean Income,Top 10%'s Income,Population Income,The Richest's Total Income,Competition,Cluster Label
60,Paris 16e Arrondissement,Auteuil,71581,48.849055,2.2663,3559.0,11032.9,254756779.0,78974601.49,3.0,4
56,Paris 15e Arrondissement,Saint-Lambert,86656,48.834458,2.30056,2734.0,6288.2,236917504.0,54491025.92,2.0,4
29,Paris 8e Arrondissement,Faubourg du Roule,9305,48.874666,2.303822,3814.0,12204.8,35489270.0,11356566.4,0.0,2
21,Paris 6e Arrondissement,Odeon,7939,48.84986,2.338692,3604.0,11532.8,28612156.0,9155889.92,1.0,2
16,Paris 5e Arrondissement,Saint-Victor,11673,48.84742,2.352845,3011.0,7828.6,35147403.0,9138324.78,2.0,2


#### Final verdict

Even though Auteuil seems to be a good candidate with the highest total income of the rich, the competition there is much harder, since within 500 meters from the neighborhood's center, there are already 3 clinics in business. On the other hand, I was first hesitating between the 2nd and 3rd best choices, since Saint-Lambert is more competitive than Faubourg du Roule (2 competitors versus zero), but in the end there are much more rich people and their total income is five time higher in Saint-Lambert than in Faubourg du Roule. The potential is clearly more evident.

## So, my next plan would be in: Saint-Lambert of Paris 15e Arrondissement

#### Thank you!