# Data Science Specialization - Capstone project

### IBM/Coursera Applied Data Science specialization course

## Table of contents

- [Introduction](#introduction)
- [Data](#data)
- [Methodology](#methodology)
- [Analysis](#analysis)
- [Results and discussion](#results-and-discussion)
- [Conclusions](#conclusions)

## Introduction

**The business problem** we are solving is a search of an optimal location for a restaurant. The optimal location should be defined for *the Moscow, Russia*. Another case is that stakehoders planning to open an Italian cuisine restaurant. Also location should be not so far from the city center and in the not so crowded by other restaurants area. Stakeholders are interested in locations around city center not further than 6 kilometers.

There are a lot of restaurants at the Moscow center, so we will need to define areas with a moderate competition level. This will require us to select areas with a limited number of another restaurants around. Also since we are searching a place for the Italian restaurant we should awoid placing the new restaurant near any existing Italian restaurant.

Stakeholders are interested in the location which will be as close as possible to the city center and their interest is dropping with the distance. This requirement will surely interfere with amount of restaurants in location. But in general we should fullfill first two requirements(concurrency/no other Italian restaurants nearby) before the distance from center.

Based on theese requirements we will define criteria that will help us to define the most promising locations.

## Data

Based on provided requirements we can define parameters that will affect our solution:

- total number of restaurants in an area
- distance to the nearest Italian restaurant located in an area
- distance of an area center to the city center

Another task is to find a definition of an area or location for which we will define perameters. The single area was defined as a cell in hexagonal grid which will cover the city. 

Centers of areas and whole areas grid can be generated programatically. With help of geocoding libraries we will be able to define coordinates of the Moscow city center. With this coordinates we will calculate coordinates of all areas in the city grid and distance the city center. Rewerse geocoding functionality will allow us to find addresses for the center of any area in the grid.

Restaurants data including location coordinates and type can be obtained through Foursquare API, this data then can be recalcuculated to fill amount of restaurants in each area and distance to the nearest Italian.  

### Areas grid generation

We will generate hexagonal grid which will cower the city center and all it's surrounding in predefined radius. Each generated point in the grid is the center of an area. All center points will be calculated in order to cover all ground without free space and satisfy locations size.

In [1]:
# The code was removed by Watson Studio for sharing.

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 11.4 MB/s eta 0:00:01
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 7.0 MB/s  eta 0:00:01
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting shapely
  Downloading Shapely-1.7.1-cp37-cp37m-

Let's import all libraries which we will use in further work.

In [2]:
import pandas as pd # work with data frame
import numpy as np # math operations and structures
from sklearn.cluster import KMeans # machine learning 

import requests # network requests
import math # math calculations
from collections import namedtuple #named tuple functionality

#!pip install geocoder
import geocoder # address to coordinates coding via open street maps

#!pip install shapely
import shapely.geometry # geoid geometry calculations

#!pip install pyproj
import pyproj # coordinates projection
from pyproj import Transformer, CRS # coordinates transformation

#!pip install folium
import folium # maps visualization
from folium import plugins
from folium.plugins import HeatMap

Then define geocoding and coordinates recalculation functions. This will help us in further work.

For geocoding we will use Open Street Map services. For grid generation and distance calculation we will convert latitude/longitude coordinates (WGS84 spherical coordinate) to UTM (Cartesian) meters coordinates. Also we will be able to convert calculated UTM coordinates back to latitude/longitude.

In [3]:
# address location
def findAddressLatLon(address, printResult=False):
    response = geocoder.osm(address)
    if printResult:
        print(response.osm)
    return (response.osm['y'], response.osm['x'])

def findLatLonAddress(lat, lon, printResult=False):
    response = geocoder.osm([lat, lon],  method='reverse')
    if printResult:
        print(response.osm)
    address_keys = ('addr:country', 'addr:city', 'addr:street', 'addr:housenumber')
    address = []
    for key in address_keys:
        if key in response.osm:
            address.append(response.osm[key])
    return ', '.join(address)

In [4]:
def convertLatLonToUtm(lat, lon):
    transformer = Transformer.from_crs(
        CRS(proj='latlong', ellps='WGS84'),
        CRS(proj='utm', zone=38, ellps='WGS84'))
    xy = transformer.transform(lon, lat)
    return xy[0], xy[1]

def convertUtmToLatLon(x, y):
    transformer = Transformer.from_crs(
        CRS(proj='utm', zone=38, ellps='WGS84'),
        CRS(proj='latlong', ellps='WGS84'))
    lonlat = transformer.transform(x, y)
    return lonlat[1], lonlat[0]

def calculateDistance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

Let's find coordinates of the city center with geocoding functionality. And convert them into UTM, then we will be ready to generate areas grid.

In [5]:
cityCenter = findAddressLatLon('Moscow, RU', True)
print('Moscow city center: latitude={}, longitude{}'.format(cityCenter[0], cityCenter[1]))

utmCenter = convertLatLonToUtm(cityCenter[0], cityCenter[1]) # City center in Cartesian coordinates
print('Moscow city center in UTM meters: X={}, Y={} '.format(utmCenter[0], utmCenter[1]))

{'x': 37.6174943, 'y': 55.7504461, 'addr:city': 'Москва', 'addr:state': 'Москва', 'addr:country': 'Россия'}
Moscow city center: latitude=55.7504461, longitude37.6174943
Moscow city center in UTM meters: X=37079.000654250965, Y=6203013.361121159 


Here we will define algorithm which will create a hexagonal grid in Cartesian coordinates around the provided center up to the defined grid size radius. All cells of the provided grid will cover an area defined by provided cell radius.

In [6]:
def generateHexagonalGrid(center_x, center_y, grid_radius, cell_radius):
    k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
    x_step = math.sqrt(3) * cell_radius # equilateral triangle side from inscribed circle radius
    y_step = x_step * k
    columns_count = math.ceil(2*grid_radius/x_step) + 1
    rows_count = math.ceil(columns_count*math.sqrt(3)) + 1
    x_min = center_x-grid_radius
    y_min = center_y-grid_radius-(rows_count*k*cell_radius-grid_radius)
    grid=[]
    for i in range(0, rows_count):
        y = y_min + i*y_step
        x_offset = cell_radius if i%2==0 else 0
        for j in range(0, columns_count):
            x = x_min + j*x_step + x_offset
            distance = calculateDistance(center_x, center_y, x, y)
            if (distance > grid_radius + cell_radius + 1):
                continue
            grid.append((x, y, distance))
    return grid

Now let's create a grid of areas around the city center. Stakeholders are interested in locations around city center not further than 6 kilimeters from it. So grid radius will be as a 6000 meters. Size of a single search area will be defined by radius of 800 meters so single area will cover 1,6 kilometers in dimeter.

After the grid generation we must recalculate its Cartesian coordinates into latititude/longitude so we will be able to use them for further work. Then with help of newly calculated coordinates we will be able to find address of of each area in the grid. Addresses will help us to represent target locations in the human readable form. Also we will combine all calculated information into one dataframe for simplicity of further us and add an identification number for all area centers.

In [7]:
# Tring to load previously saved cells
gridRadius = 6000
cellRadius = 800
loaded = False
try:
    print('Loading neighbours grid data.')
    
    cells = pd.read_pickle('gridCells.pkl')
    print('Restaurant data loaded.')
    
    loaded = True
except:
    print('Neighbours grid failed.')
    pass

# If load failed then generating new grid cells
if not loaded:
    print('Generating neighbours grid data.')
    gridData = generateHexagonalGrid(utmCenter[0], utmCenter[1], gridRadius, cellRadius)

    cells = pd.DataFrame(gridData, columns=['x', 'y', 'distance'])
    cells[['lat', 'lon']]=cells.apply(lambda row: list(convertUtmToLatLon(row.x, row.y)), axis=1, result_type ='expand')
    cells['address']=cells.apply(lambda cell: findLatLonAddress(cell.lat, cell.lon), axis=1, result_type ='expand')
    cells['id']=np.arange(cells.shape[0])

    print('{} grid cells created.'.format(cells.shape[0]))
    cells.head(10)
    
    cells.to_pickle('gridCells.pkl') 
    print('Neighbours grid data saved.')

Loading neighbours grid data.
Neighbours grid failed.
Generating neighbours grid data.
86 grid cells created.
Neighbours grid data saved.


Let's visualize generated grid of neighbour areas. Cells are partially intersecting but coverring all city center without spaces.

In [8]:
city_map=folium.Map(location=cityCenter, zoom_start=12)
folium.Marker(cityCenter, popup='City center').add_to(city_map)
folium.Circle(cityCenter, radius=gridRadius, color='white', fill=False).add_to(city_map)

for index, row in cells.iterrows():
    folium.Circle([row.lat, row.lon], radius=cellRadius, color='blue', fill=False).add_to(city_map)

city_map

Now let's look at the data itself.

In [9]:
cells.head(10)

Unnamed: 0,x,y,distance,lat,lon,address,id
0,34650.281946,6197050.0,6439.17962,55.694986,37.58925,"Россия, Москва, улица Карьер, 2 с1",0
1,36035.922592,6197050.0,6054.120206,55.69631,37.611111,"Россия, Москва, Загородное шоссе",1
2,37421.563238,6197050.0,5973.41684,55.697631,37.632974,"Россия, Москва, улица Татлина",2
3,38807.203885,6197050.0,6208.948866,55.698948,37.654838,"Россия, Москва, 2-й Кожуховский проезд, 29 к6",3
4,40192.844531,6197050.0,6727.583764,55.700261,37.676704,"Россия, Москва, 2-й Южнопортовый проезд",4
5,32464.6413,6198250.0,6632.048336,55.70358,37.552728,"Россия, Москва, улица Косыгина, 17 к7",5
6,33850.281946,6198250.0,5754.683083,55.704911,37.574592,"Россия, Москва, Ленинский проспект, 36",6
7,35235.922592,6198250.0,5107.708844,55.706239,37.596457,"Россия, Москва, улица Орджоникидзе, 11 с11",7
8,36621.563238,6198250.0,4785.499125,55.707562,37.618325,"Россия, Москва, Малая Тульская улица, 25",8
9,38007.203885,6198250.0,4853.175674,55.708882,37.640194,"Россия, Москва, Даниловская набережная, 6",9


So data looks normal. It contain both Cartesian and spherical coordinates, distance from city center and address of every location. Also every location has it's own identifier which will help in further operations.

Now we can move further and retrieve an information about restaurants located in the city center.

### Restaurants data

Now we require to get an information about restaurants in the city center and theirs alignment to one of areas in the previously generated grid. We can use Foursquare API to get info on restaurants theirs location.

We're interested in venues in 'food' category. This category contains many different types of food venues not only restaurants. So we should filter all that we cant count as a restaurant. The reason of a such behaviour is that other types of food venues (like bakery) are not direct competitors.

We will include into restaurants list only venues with 'restaurant', 'diner', 'taverna' or 'steakhouse' words in category name. Also we will find all venue categories that correspond to an Italian restaurant. So we will get from 'food' categories list category with name 'italian' and all of it's subcategories.

First we should define functions to operate with categories data. We need to retrieve it, find subcategory in it, convert tree like structure of subcategories into flat list and we need a functionality to remove all unnecessary parts of received objects.

In [10]:
# The code was removed by Watson Studio for sharing.

In [11]:
def requestCategories(showErrors=False):
    foursquare_categories_request='https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'
    url = foursquare_categories_request.format(foursquare_id, foursquare_secret, foursquare_api_version)
    try:
        categories = requests.get(url).json()['response']['categories']
    except e:
        if showErrors:
            print(e)
        categories=[]
    return categories

def findSubcategory(category, subcategoryName):
    search_queue = [category]
    while search_queue:
        item = search_queue.pop(0)
        if item['shortName'] == subcategoryName:
            return item;
        else:
            search_queue.extend(item['categories'])
    return None
        

def flatCategories(category):
    categories = [category]
    for subcategory in category['categories']:
        subcategories = flatCategories(subcategory)
        categories.extend(subcategories)
    return categories

def parseCategories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

Then let's request categories list from Foursquare and find food category in it.
Then we will search for all Italian restaurant categories. The final list you can see below.

Now we have food category identifier and list of all italian food categories.

In [12]:
rawCategories=requestCategories()
for item in rawCategories:
    if item['name']=='Food':
        foodCategory = item

foodId = foodCategory['id']
print('Food category id: {}'.format(foodId))

italianCategories = parseCategories(flatCategories(findSubcategory(foodCategory, 'Italian')))
italianCategories

Food category id: 4d4b7105d754a06374d81259


[('Italian Restaurant', '4bf58dd8d48988d110941735'),
 ('Abruzzo Restaurant', '55a5a1ebe4b013909087cbb6'),
 ('Agriturismo', '55a5a1ebe4b013909087cb7c'),
 ('Aosta Restaurant', '55a5a1ebe4b013909087cba7'),
 ('Basilicata Restaurant', '55a5a1ebe4b013909087cba1'),
 ('Calabria Restaurant', '55a5a1ebe4b013909087cba4'),
 ('Campanian Restaurant', '55a5a1ebe4b013909087cb95'),
 ('Emilia Restaurant', '55a5a1ebe4b013909087cb89'),
 ('Friuli Restaurant', '55a5a1ebe4b013909087cb9b'),
 ('Ligurian Restaurant', '55a5a1ebe4b013909087cb98'),
 ('Lombard Restaurant', '55a5a1ebe4b013909087cbbf'),
 ('Malga', '55a5a1ebe4b013909087cb79'),
 ('Marche Restaurant', '55a5a1ebe4b013909087cbb0'),
 ('Molise Restaurant', '55a5a1ebe4b013909087cbb3'),
 ('Piadineria', '55a5a1ebe4b013909087cb74'),
 ('Piedmontese Restaurant', '55a5a1ebe4b013909087cbaa'),
 ('Puglia Restaurant', '55a5a1ebe4b013909087cb83'),
 ('Romagna Restaurant', '55a5a1ebe4b013909087cb8c'),
 ('Roman Restaurant', '55a5a1ebe4b013909087cb92'),
 ('Sardinian Restau

So it's time to receive restaurants data from Foursquare api. 

We need a way to request a list venues around of the provided coordinates, we need to define if received venue is a restaurant or not. Also we need a wat to define if venue category is one of required italian categories. And we need to request retaurants list for every area in the previously generated grid.

About the grid, as we saw earlier areas are intersecting, so it's possible for one venue to be returned for two areas. To deal with this possibility we will order areas by distance from city center and will request venues for areas in that order. Also we will track all previously assigned venues and forbid them from further areas. In the end we will have no repeated venues in the list.

In [13]:
Venue = namedtuple('Venue',['id','name', 'lat', 'lon', 'address', 'categories', 'distance'])

def getVenueAddress(location):
    return ', '.join(location['formattedAddress'])

def combineCategoryIds(categories):
    return ','.join(c['id'] for c in categories)

def isRestaurant(categories):
    targetMarks = ['restaurant', 'diner', 'taverna', 'steakhouse']
    for category in categories:
        for mark in targetMarks:
            if mark in category['name'].lower():
                return True
    return False

def isRequeiredCategory(requiredCategories, categories):
    for  categoryName, categoryId in categories:
        if categoryId in requiredCategories:
            return True
    return False

def requestVenues(lat, lon, categories_list : list, radius=600, limit=100, showError=False):
    foursquare_venues_request='https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'
    categories = ",".join(categories_list)
    url = foursquare_venues_request.format(
        foursquare_id, foursquare_secret, foursquare_api_version, #foursquare setup
        lat, lon, #search center coordinates
        categories, radius, limit) #venues request settings
    
    try:
        result = requests.get(url).json()['response']['groups'][0]['items']
        venues = [Venue(item['venue']['id'], 
                        item['venue']['name'],
                        item['venue']['location']['lat'], 
                        item['venue']['location']['lng'],
                        getVenueAddress(item['venue']['location']),
                        item['venue']['categories'],
                        item['venue']['location']['distance']) for item in result] 
    except e:
        if showError:
            print(e)
        venues=[]
    return venues

def findGridVenues(gridCells, categories, radius, requiredVenue):
    orderedCells = gridCells.sort_values(by='distance')
    knownVenues = set()
    gridVenues = {}
    for index, cell in orderedCells.iterrows():
        cellVenues = []
        gridVenues[cell.id] = cellVenues
        venues = requestVenues(cell.lat, cell.lon, categories, radius)
        for venue in venues:
            if not requiredVenue(venue) or venue.id in knownVenues:
                continue
            knownVenues.add(venue.id)
            cellVenues.append(venue)
    return gridVenues

From requested data we will create data frame for simplicity of further work with information and update it with necessary data. We will define their cartesian coordinates for further distance calculations and we will mark italian restaurants in the list.

In [14]:
# Try to load previously saved data
loaded = False
try:
    print('Loading restaurants data.')
    
    restaurants = pd.read_pickle('restaurants.pkl')
    print('Restaurant data loaded.')
    
    loaded = True
except:
    print('Loading failed.')
    pass

# If load failed then request data from Foursquare service
if not loaded:
    print('Requesting restaurants data.')
    
    
    gridVenues = findGridVenues(cells, [foodId], cellRadius, lambda venue: isRestaurant(venue.categories))
    venuesData = []
    for cellId in gridVenues:
        venuesData.extend((venue.id, venue.name, venue.lat, venue.lon, 
                           venue.address, combineCategoryIds(venue.categories), 
                           cellId, venue.distance) for venue in gridVenues[cellId])
    
    restaurants = pd.DataFrame(venuesData, columns=['id', 'name', 'lat', 'lon', 'address', 'categories', 'cellId', 'cellDistance'])
    restaurants[['x', 'y']]=restaurants.apply(lambda row: list(convertLatLonToUtm(row.lat, row.lon)), axis=1, result_type ='expand')
    
    _, requiredCategories = map(list, zip(*italianCategories))
    restaurants['isItalian'] = restaurants['categories'].str.contains("|".join(requiredCategories), case=False)
    print('{} restaurants received.'.format(restaurants.shape[0]))
    
    restaurants.to_pickle('restaurants.pkl')
    print('Restaurants data saved.')

Loading restaurants data.
Loading failed.
Requesting restaurants data.
1543 restaurants received.
Restaurants data saved.


Let's show the requested data.

In [15]:
restaurants.head(10)

Unnamed: 0,id,name,lat,lon,address,categories,cellId,cellDistance,x,y,isItalian
0,4d05f594dc45a0936f9cf1c6,Корчма «Тарас Бульба»,55.750644,37.610157,"Моховая ул., 8, стр. 1, 119019, Москва, Россия",52e928d0bcbc57f1066b7e96,42,34,36622.236463,6203084.0,False
1,55e1cfcb498ee04c095f4c7a,Пян-се,55.752643,37.609967,"ул. Воздвиженка, 4/7, Москва, Россия",4bf58dd8d48988d108941735,42,257,36634.107861,6203308.0,False
2,4c409b03d7fad13a513c06da,Ширван,55.751243,37.607539,"Староваганьковский пер., 19, стр. 7, 111222, М...",5293a7d53cf9994f4e043a45,42,195,36465.540621,6203169.0,False
3,5b3361f060255e002c1f4270,Il Pizzaiolo,55.748705,37.609088,"Волхонка 6, 119019, Москва, Россия",4bf58dd8d48988d110941735,42,194,36532.278887,6202877.0,True
4,53b94da2498eafad5864e500,Dolmama,55.753999,37.609699,"Романов пер., 2/6, стр. 13, Москва, Россия",5f2c2b7db6d05514c7044837,42,409,36633.441482,6203460.0,False
5,5165121ae4b073f743689855,Ugolek (Уголёк),55.756457,37.606357,"Большая Никитская ул., 12 (Газетный пер.), Мос...",4bf58dd8d48988d1c4941735,42,723,36453.646308,6203755.0,False
6,5873b35702b60e3b699f05bb,Beluga (Белуга),55.756685,37.614032,"Моховая ул., 15/1, стр. 1, 125009, Москва, Россия",5293a7563cf9994f4e043a44,42,746,36936.481084,6203729.0,False
7,541c4b83498e3fbc997f2c61,Dr. Zhivago (Dr. Zhivago (Dr. Живаго)),55.756882,37.614406,"Моховая ул., 15/1, стр. 1, Москва, Россия",5293a7563cf9994f4e043a44,42,774,36962.276151,6203748.0,False
8,55e96f70498e433f9cf915f2,Severyane (Северяне),55.756438,37.606267,"Большая Никитская ул., 12, стр. 1, 125009, Мос...",4bf58dd8d48988d1c4941735,42,722,36447.726748,6203754.0,False
9,4e75eac152b1c8e519a70340,Чемодан,55.748695,37.59969,"Гоголевский бул., 25, Москва, Россия",5293a7563cf9994f4e043a44,42,683,35944.009197,6202939.0,False


So we have a list of all restaurants in the city center with cartesian and spherical coordinates, addresses and categories. Also this data contains if restaurant is Italian and number of area in the grid to which it corresponds. 

This is all information we required to move further, but before let's try to look on to it and define some simple metrics like amount of italian restaurants, average amount of restaurants in area and average amount of italian restaurants in every area.

This information will help us to negotiate with stakeholders the defenition of crowded areas and preferable distance to neares Italian restaurant. 

In [16]:
restaurantsTotal = restaurants.shape[0]
italianTotal = restaurants[restaurants['isItalian'] == True]['isItalian'].count()
print('Total number of restaurants:', restaurantsTotal)
print('Total number of Italian restaurants:', italianTotal)
print('Percentage of Italian restaurants: {:.2f}%'.format(italianTotal/restaurantsTotal * 100))
print('Average number of restaurants in neighborhood:', restaurants.groupby('cellId').count()['name'].mean())
print('Average number of Italian restaurants in neighborhood:', restaurants[restaurants['isItalian'] == True]
      .groupby('cellId').count()['name'].mean())

Total number of restaurants: 1543
Total number of Italian restaurants: 162
Percentage of Italian restaurants: 10.50%
Average number of restaurants in neighborhood: 18.36904761904762
Average number of Italian restaurants in neighborhood: 2.793103448275862


In [17]:
italianRestaurants = restaurants[restaurants['isItalian'] == True]
italianRestaurants

Unnamed: 0,id,name,lat,lon,address,categories,cellId,cellDistance,x,y,isItalian
3,5b3361f060255e002c1f4270,Il Pizzaiolo,55.748705,37.609088,"Волхонка 6, 119019, Москва, Россия",4bf58dd8d48988d110941735,42,194,36532.278887,6.202877e+06,True
11,5df29eaeee0c6b000862e171,Lamberti,55.757407,37.612473,"Тверская ул., 3, 125009, Москва, Россия",4bf58dd8d48988d110941735,42,800,36847.562282,6.203819e+06,True
13,511d12a5e4b07e37595aebbc,Пат-а-шу,55.754134,37.602903,"Калашный Переулок 4, Россия",4bf58dd8d48988d110941735,42,623,36209.847097,6.203520e+06,True
25,57fcafef498ea7704e9fbfac,Trattoria Luigi,55.749021,37.607541,"Деловой дом «Знаменка» (ул. Знаменка, 7, стр. ...",4bf58dd8d48988d110941735,42,221,36439.187465,6.202922e+06,True
28,5d739e56462f640008870cbc,Il Letterato,55.752193,37.611225,"1 Vozdvizhenka Street (Mokhovaya Street), 1190...",4bf58dd8d48988d110941735,42,216,36707.467947,6.203249e+06,True
...,...,...,...,...,...,...,...,...,...,...,...
1518,4d6e48b0792bb60cb25960be,Ceretto (Черетто),55.793408,37.544717,"Ленинградский просп., 37Б, 125167, Москва, Россия",4bf58dd8d48988d110941735,74,710,33040.676134,6.208270e+06,True
1519,5afd8a4df00a70002cf0adbf,Rustic,55.790811,37.531359,"Ходынский бул., 4, 125167, Москва, Россия",4bf58dd8d48988d110941735,74,364,32174.590143,6.208072e+06,True
1524,5d29bccac835850024e6460b,Amarena Albero,55.789362,37.538631,"Ходынский бульвар (2), 123007, Москва, Россия",4bf58dd8d48988d110941735,74,146,32611.693419,6.207862e+06,True
1526,592c59690ff4f9137ea5a844,Forte Bello,55.790518,37.531739,"Ходынский бул., 4, 125167, Москва, Россия",4bf58dd8d48988d110941735,74,327,32194.806830,6.208037e+06,True


Now let's display restaurants on the city map and try to find some additional information from it. Also we will mark Italian restaurants with red color, all other will be blue. 

In [18]:
# display all restaurants in the map
cityMap=folium.Map(location=cityCenter, zoom_start=12)
folium.Marker(cityCenter, popup='City center').add_to(cityMap)

for index, restaurant in restaurants.iterrows():
    color = 'crimson' if restaurant.isItalian else 'blue'
    folium.CircleMarker([restaurant.lat, restaurant.lon], radius=2, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(cityMap)

cityMap

It looks like most of the restaurants tends to align with main city roads. This can be helpfull when stakeholders will try to define specific location of a restaurant.

Now we have all required data, next we'll try analyse it to find required area.

## Methodology

In this project we need to find areas of Moscow that have low restaurant density, and even less Italian restaurants. Search area are limited within 6 kilometers radius from city center.

As a first step we have collected data: location and type (category) of every restaurant within 6 kilimeters distance from the city center. We have also identified Italian restaurants (according to Foursquare categorization) and defined allocation area for all of them.

Second step will be calculation and exploration of 'restaurant density' across different generated areas. For this purpose we will usel use heatmaps visualization to identify a few promising areas as to the center with low number of restaurants in general and focus our attention on those areas.

The third step we will create a metric of area quality. This metric will help us range areas and select the best position for a new restaurant. With such metric in hand we will be able to clusterize areas by their quality. This will help to negate effecs coused by grid cells allocation and find the target location.

### Analysis 
Let's perform analysis on the gathered data. First let's count the number of restaurants in every area.

We will define new 'locations' dataframe which will help up with data processing. For this we will cope cells data and enrich it with location data from restaurants. Also we will need to fill empty values for areas without restaurants. It's possible and empty values will be defined as **NaN**. It can ruin further calculations so let's fill them with zeros. 

Also we will calculate distance from center of every area to nearest Italian restaurant.

In [19]:
locations = cells.copy()
locations['resCount'] = restaurants.groupby('cellId').count()['name'] # ordered by cellId authomatically, so can be simply applied
locations['resCount'] = locations['resCount'].fillna(0)
locations['italianDistance'] = locations.apply(
    lambda l: italianRestaurants.apply(lambda r: calculateDistance(l.x, l.y, r.x, r.y), axis=1).min(),
    axis=1)
locations.head(10)

Unnamed: 0,x,y,distance,lat,lon,address,id,resCount,italianDistance
0,34650.281946,6197050.0,6439.17962,55.694986,37.58925,"Россия, Москва, улица Карьер, 2 с1",0,1.0,1094.168986
1,36035.922592,6197050.0,6054.120206,55.69631,37.611111,"Россия, Москва, Загородное шоссе",1,1.0,921.820247
2,37421.563238,6197050.0,5973.41684,55.697631,37.632974,"Россия, Москва, улица Татлина",2,6.0,477.14173
3,38807.203885,6197050.0,6208.948866,55.698948,37.654838,"Россия, Москва, 2-й Кожуховский проезд, 29 к6",3,12.0,777.545625
4,40192.844531,6197050.0,6727.583764,55.700261,37.676704,"Россия, Москва, 2-й Южнопортовый проезд",4,3.0,858.337066
5,32464.6413,6198250.0,6632.048336,55.70358,37.552728,"Россия, Москва, улица Косыгина, 17 к7",5,0.0,1738.809975
6,33850.281946,6198250.0,5754.683083,55.704911,37.574592,"Россия, Москва, Ленинский проспект, 36",6,6.0,398.677813
7,35235.922592,6198250.0,5107.708844,55.706239,37.596457,"Россия, Москва, улица Орджоникидзе, 11 с11",7,13.0,514.073065
8,36621.563238,6198250.0,4785.499125,55.707562,37.618325,"Россия, Москва, Малая Тульская улица, 25",8,21.0,383.162625
9,38007.203885,6198250.0,4853.175674,55.708882,37.640194,"Россия, Москва, Даниловская набережная, 6",9,16.0,441.838558


In [20]:
print('Average distance to closest Italian restaurant from each area center:', locations['italianDistance'].mean())

Average distance to closest Italian restaurant from each area center: 649.9577292230562


Let's show all restaurants on heatmap and try to find some additional information. 

In [21]:
cityMap = folium.Map(location=cityCenter, zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(cityMap)
HeatMap(restaurants[['lat', 'lon']]).add_to(cityMap)

folium.Marker(cityCenter).add_to(cityMap)
folium.Circle(cityCenter, radius=0.33*gridRadius, color='white', fill=False).add_to(cityMap)
folium.Circle(cityCenter, radius=0.66*gridRadius, color='white', fill=False).add_to(cityMap)
folium.Circle(cityCenter, radius=gridRadius, color='white', fill=False).add_to(cityMap)

cityMap

The same for Italian restaurants.

In [22]:
cityMap = folium.Map(location=cityCenter, zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(cityMap)
HeatMap(italianRestaurants[['lat', 'lon']]).add_to(cityMap)

folium.Marker(cityCenter).add_to(cityMap)
folium.Circle(cityCenter, radius=0.33*gridRadius, color='white', fill=False).add_to(cityMap)
folium.Circle(cityCenter, radius=0.66*gridRadius, color='white', fill=False).add_to(cityMap)
folium.Circle(cityCenter, radius=gridRadius, color='white', fill=False).add_to(cityMap)

cityMap

Based on what we can see center mostly crowded with restaurants and more or less free space we can find on the outer third of city center area. There are a few windows in the west and south which can be desided as an interesting regions. 

Italian restaurants map shows a little amount of restaurants in third part of the radius. We have no free space in the north in the first and second parts but we can see free spaces on the west an south east.

So both maps shows an empty third part and crowded north half with some spots on west.

That's interesting and can give us a direction for further exploration. But how we defined that norther area is crowded, what was definition of 'crowded'. We saw red map parts but that's not enought. We should create some measurable way to check quality of any area. And with help of this tool we will be able to find the best location. 

First we will know why this location is the best, we will be able to describe how we selected this location and we will be able to correct and calibrate this tool according the stakeholders requiremets.

So we need to define a quality metric for our areas.

### Quality metric

Let's define the quality measurement metric which we will apply to every area. Whus we will be able to range them and find the best one.

Our stakeholders not intereste in areas further than 6 kilometers from center, so let it return zero for all areas further then this distance and let it return one at the center of the city. We can define this as a difference between our grid range and area distance to the center both divided by grid range.

We can return one for every area where amount of restaurants lower than required, for other areas we will return maximal amount or restaurants allowed in area divided by current amount of restaurant. So this metric will decrease to zero while amount of restaurants in area increasing.

The same way we can behave for the distance to the nearest Italian restaurant. We will return one for areas with distance below above required and divede current distance by maximal allowed for all others.

So we have a set from 3 separate metrics each can return a value from zero to one. We can combine them simply by myltipling and receive a new metric which will return a value from zero to one.

In [23]:
def createQualityDistance(distanceToCenter, distanceToCompetitor, totalCount):
    return lambda testCenter, testNearest, testCount: calculateQualityDistance(
        distanceToCenter, distanceToCompetitor, totalCount,
        testCenter, testNearest, testCount)

def calculateQualityDistance(centerDistance, competitorDistance, totalCount, testCenter, testNearest, testCount):
    distanceModifier = 0.0 if testCenter >= centerDistance else (centerDistance - testCenter)/centerDistance
    competitorModifier = 1.0 if testNearest >= competitorDistance else testNearest/competitorDistance
    countModifier = 1.0 if testCount <= totalCount else totalCount/testCount
    return distanceModifier * competitorModifier * countModifier



By applying this metric to all locations we will receive a value which will allow to range them and find the best location.

But first we need to define maximal allowed amount of restaurants in area and minimal distance to the nearest Italian restaurant.
Since awerage number of restaraunts in area was calculated as 18.5, stakeholder defined maximal amount of restaurants on the level of 15.
And since average distance to nearest Italian restaurant was  664 meters, allowed distance was set at the 400 meters.

We will calculate previously defined metric for every location, and than we will normalize accepted quality values for simplicity of visualization.

In [24]:
qualityCalculator = createQualityDistance(gridRadius, 400, 15)

locations['quality'] = locations.apply(lambda l: qualityCalculator(l.distance, l.italianDistance, l.resCount),axis=1)
locations['quality'] = (locations['quality']-locations['quality'].min())/(locations['quality'].max()-locations['quality'].min())
locations.head(10)

Unnamed: 0,x,y,distance,lat,lon,address,id,resCount,italianDistance,quality
0,34650.281946,6197050.0,6439.17962,55.694986,37.58925,"Россия, Москва, улица Карьер, 2 с1",0,1.0,1094.168986,0.0
1,36035.922592,6197050.0,6054.120206,55.69631,37.611111,"Россия, Москва, Загородное шоссе",1,1.0,921.820247,0.0
2,37421.563238,6197050.0,5973.41684,55.697631,37.632974,"Россия, Москва, улица Татлина",2,6.0,477.14173,0.010985
3,38807.203885,6197050.0,6208.948866,55.698948,37.654838,"Россия, Москва, 2-й Кожуховский проезд, 29 к6",3,12.0,777.545625,0.0
4,40192.844531,6197050.0,6727.583764,55.700261,37.676704,"Россия, Москва, 2-й Южнопортовый проезд",4,3.0,858.337066,0.0
5,32464.6413,6198250.0,6632.048336,55.70358,37.552728,"Россия, Москва, улица Косыгина, 17 к7",5,0.0,1738.809975,0.0
6,33850.281946,6198250.0,5754.683083,55.704911,37.574592,"Россия, Москва, Ленинский проспект, 36",6,6.0,398.677813,0.101036
7,35235.922592,6198250.0,5107.708844,55.706239,37.596457,"Россия, Москва, улица Орджоникидзе, 11 с11",7,13.0,514.073065,0.368717
8,36621.563238,6198250.0,4785.499125,55.707562,37.618325,"Россия, Москва, Малая Тульская улица, 25",8,21.0,383.162625,0.343384
9,38007.203885,6198250.0,4853.175674,55.708882,37.640194,"Россия, Москва, Даниловская набережная, 6",9,16.0,441.838558,0.444278


Let's visualize areas according to theis quality. The higher quality the brighter will be circle bordering the area. This will hel up visually identificate most of the promising areas and find quality allocation patterns.

In [25]:
cityMap = folium.Map(location=cityCenter, zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(cityMap)
for index, location in locations.iterrows():
    folium.Circle([location.lat, location.lon], radius=cellRadius, color='blue', opacity=location.quality, fill=False).add_to(cityMap)

HeatMap(italianRestaurants[['lat', 'lon']]).add_to(cityMap)

folium.Marker(cityCenter).add_to(cityMap)
folium.Circle(cityCenter, radius=gridRadius, color='white', fill=False).add_to(cityMap)

cityMap

In [26]:
cityMap = folium.Map(location=cityCenter, zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(cityMap)
for index, location in locations.iterrows():
    folium.Circle([location.lat, location.lon], radius=cellRadius, color='blue', opacity=location.quality, fill=False).add_to(cityMap)

HeatMap(restaurants[['lat', 'lon']]).add_to(cityMap)

folium.Marker(cityCenter).add_to(cityMap)
folium.Circle(cityCenter, radius=gridRadius, color='white', fill=False).add_to(cityMap)

cityMap

We are able to define some interesting locations most of them on the east but some can found on the west and even north.

Let's select top ten best locations and visualize them on separate map. Contrast of color will be increased so we will be able to distinct quality order with ease.

In [27]:
locations.sort_values(by='quality', ascending=False).head(10)

Unnamed: 0,x,y,distance,lat,lon,address,id,resCount,italianDistance,quality
16,37421.563238,6199450.0,3580.013308,55.719018,37.628926,"Россия, Москва, 3-й Павловский переулок, 2 с1",16,14.0,505.698918,1.0
70,37421.563238,6206650.0,3652.513505,55.783176,37.616752,"Россия, Москва, Большая Екатерининская улица, ...",70,13.0,594.933627,0.970041
45,40778.485177,6203050.0,3699.663728,55.754287,37.675889,"Россия, Москва, Строгановский проезд",45,5.0,492.803866,0.950557
54,40192.844531,6204250.0,3350.334748,55.764427,37.664616,"Россия, Москва, Малый Демидовский переулок, 3",54,18.0,421.446092,0.912424
30,33264.6413,6201850.0,3987.890417,55.736424,37.559221,"Россия, Москва, Бережковская набережная, 16",30,12.0,524.468106,0.831455
59,33850.281946,6205450.0,4044.840788,55.769062,37.562335,"Россия, Москва, Ходынская улица, 3",59,8.0,415.079106,0.807921
43,38007.203885,6203050.0,928.917222,55.751656,37.632098,"Россия, Москва, улица Варварка, 6 с4",43,39.0,402.621934,0.805962
60,35235.922592,6205450.0,3055.003969,55.770393,37.584235,"Россия, Москва, Тишинская площадь, 6",60,23.0,635.427201,0.793661
26,39392.844531,6200650.0,3307.629501,55.731586,37.658031,"Россия, Москва, Крестьянская площадь",26,17.0,313.143944,0.768508
32,36035.922592,6201850.0,1562.672244,55.739081,37.602988,"Россия, Москва, Бутиковский переулок, 3",32,37.0,547.396688,0.743358


In [28]:
cityMap = folium.Map(location=cityCenter, zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(cityMap)

targetLocations = locations.sort_values(by='quality', ascending=False).head(10)
for index, location in targetLocations.iterrows():
    folium.Circle([location.lat, location.lon], radius=cellRadius, color='blue', opacity=location.quality**3, fill=False).add_to(cityMap)

HeatMap(restaurants[['lat', 'lon']]).add_to(cityMap)

folium.Marker(cityCenter).add_to(cityMap)
folium.Circle(cityCenter, radius=gridRadius, color='white', fill=False).add_to(cityMap)

cityMap

Here we can see even the promising area in the center of the city. Not the best according to color but surelly promising. The best location on the south and it should the first place we will visit. Also we can see a group of good locations on the east. On the west and north we can find separate locations that are suitable for our purposes, but all of them at the second third of allowed radius.

Now we will try to investigate internal structure of the quality grid. We will clisterize locations according to ther location and quality and try get some additional information from that.

In [29]:
clustersNumber = 20

values = locations[['lat', 'lon', 'quality']].values #
clustering = KMeans(n_clusters=clustersNumber, random_state=0).fit(values)
cluster_centers = [(cluster[0], cluster[1], cluster[2]) for cluster in clustering.cluster_centers_]

cityMap = folium.Map(location=cityCenter, zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(cityMap)
HeatMap(restaurants[['lat', 'lon']]).add_to(cityMap)

folium.Circle(cityCenter, radius=gridRadius, color='white', fill=True, fill_opacity=0.4).add_to(cityMap)
folium.Marker(cityCenter).add_to(cityMap)

for lat, lon, quality in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=True, fill_opacity=quality).add_to(cityMap)
    
for index, location in targetLocations.iterrows():
    folium.Circle([location.lat, location.lon], radius=cellRadius, color='blue', opacity=location.quality**3, fill=False).add_to(cityMap)
    
cityMap

Let's at the clusters data more close. 

In [30]:
clusters = pd.DataFrame(cluster_centers, columns=['lat', 'lon', 'quality'])
clusters.sort_values(by='quality', ascending=False)

Unnamed: 0,lat,lon,quality
5,55.755227,37.646546,0.958256
17,55.758626,37.568597,0.811013
2,55.745405,37.648041,0.770484
19,55.740271,37.586919,0.712236
10,55.731138,37.650636,0.651728
7,55.75368,37.66579,0.568543
0,55.759955,37.590491,0.489676
15,55.777623,37.632293,0.46352
14,55.72871,37.67022,0.428157
6,55.713451,37.581682,0.342852


We can see that cluster with the best quality have nearly all possible value it's score 0.95. So this location is really promising it will be easy to fit a new Italian restaurant nearby. Also it's combines the  group of good locations with the nearest good to the center. So east direction we can define as a main.  

Also according to clusters data a good location can be found at the south west direction from the center. Where as north in general have a pure quality. It will be hard to find a good place. The same could told about south east. 

Several interesting locations are connected by the same cluster so it supports our choise. Now we can define adresses for the best locations of the future restaurant.

The best quality cluster location address:

In [31]:
bestClusterCoords = clusters.sort_values(by='quality', ascending=False).head(1)[['lat', 'lon']].values
findLatLonAddress(bestClusterCoords[0][0], bestClusterCoords[0][1])

'Россия, Москва, Большой Трёхсвятительский переулок, 3'

The best chances to find required location will be around there.

The best location according to our research:

In [32]:
locations.sort_values(by='quality', ascending=False).head(1)

Unnamed: 0,x,y,distance,lat,lon,address,id,resCount,italianDistance,quality
16,37421.563238,6199450.0,3580.013308,55.719018,37.628926,"Россия, Москва, 3-й Павловский переулок, 2 с1",16,14.0,505.698918,1.0


The nearest of good locations is:

In [33]:
locations[locations['quality'] > 0.8].sort_values(by='distance').head(1)

Unnamed: 0,x,y,distance,lat,lon,address,id,resCount,italianDistance,quality
43,38007.203885,6203050.0,928.917222,55.751656,37.632098,"Россия, Москва, улица Варварка, 6 с4",43,39.0,402.621934,0.805962


So the most interesting locations addresses are:

In [34]:
print(*locations.sort_values(by='quality', ascending=False).head(1)['address'].values)
print(*locations[locations['quality'] > 0.8].sort_values(by='distance').head(1)['address'].values)

Россия, Москва, 3-й Павловский переулок, 2 с1
Россия, Москва, улица Варварка, 6 с4


In [62]:
cityMap = folium.Map(location=cityCenter, zoom_start=12)
folium.TileLayer('cartodbpositron').add_to(cityMap)

folium.Circle(cityCenter, radius=gridRadius, color='white', fill=True, fill_opacity=0.2).add_to(cityMap)
folium.Marker(cityCenter).add_to(cityMap)

bestLocation = locations.sort_values(by='quality', ascending=False).head(1)
closestLocation = locations[locations['quality'] > 0.8].sort_values(by='distance').head(1)
bestCluster = clusters.sort_values(by='quality', ascending=False).head(1)

folium.Circle([bestLocation['lat'], bestLocation['lon']], radius=cellRadius, color='blue', opacity=bestLocation['quality'].values[0]**3, fill=False).add_to(cityMap)
folium.Circle([closestLocation['lat'], closestLocation['lon']], radius=cellRadius, color='blue', opacity=closestLocation['quality'].values[0]**3, fill=False).add_to(cityMap)
folium.Circle([bestCluster['lat'], bestCluster['lon']], radius=500, color='green', fill=True, fill_opacity=bestCluster['quality'].values[0]).add_to(cityMap)

cityMap

So we can provide to our stakeholders this two most interesting locations along with list of the best ones. It will allow to start 'street' search of specific bulding for a new restaurant.

## Results and Discussion

Our analysis shows that we are able to find a good place for an Italian restaurant even in so dense city as a Moscow, there are pockets of low restaurant density fairly close to city center. Highest concentration of restaurants was detected north and west from the city Center, and the best positions should be at the eastern part of the city. Some additional locations was identified as interesting even on the crowded northern part of the city. But we was concentrated on the most promising ones.

We have proved our choise of the best places with help of clusterization which shows that both locations selected by us are connecting with a cluster with the best quality.

Result of 10 best locations with best quality a are also awailable to stakeholders. This, of course, does not imply that those zones are actually optimal locations for a new restaurant. This analysis was to only provides info on areas close to the Moscow center but not crowded with existing restaurants (particularly Italian). It is possible that there is a good reason for small number of restaurants in any of those areas, reasons which would make them unsuitable for a new restaurant allocation. Also it's possible that additional factors and smaller locations grid will provide more specific results. But non the less provided locations can be considered as a starting point for more detailed analysis which could result with more detailed specifications for an analysis or allocation of a new restaursnt.

## Conclusion

Purpose of this project was to identify locations in Moscow close to center with low number of restaurants (especially Italian) in order to aid stakeholders in the search for optimal location for a new Italian restaurant. By processing Foursquare data we was able to identify areas of interest. We found some promising areas which sutisfy all requirements. Clustering algorithms prove our locations selection. And we were able provide adderesses of the most interesting locations to the stakeholders. Also was found additional good locations that was indistinguishable with simple eye. This was found with help of simple and configurable metric which could be discussed and improved if necessary.

Final decission on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended area.