Capstone Project - Opening a New Fitness Club in Moscow, Russia (Week 5)
======================

Applied Data Science Capstone by IBM/Coursera
------

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data collection](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

Introduction <a name="introduction"></a>
-------------

This project focuses on the selection of the optimal location for a fitness club in Moscow, Russia.

The review [Forbes](https://www.forbes.ru/biznes/417589-razrushitelnoe-cunami-2020-god-stal-hudshim-za-vsyu-istoriyu-rossiyskoy-industrii) claims that during the pandemic, about 60% of market participants were broken.

Currently, restrictions on visits to offices, shops and entertainment centers, which include fitness clubs, are gradually being lifted. New players are expected to emerge, as well as the return of old customers.

### Business problem

It is necessary to assess the prospects of certain districts of Moscow in terms of placing new fitness clubs there. Using Data Science and clustering methods, it is necessary to prepare answers to the question of where is the most profitable to open a new fitness club.

The report should provide answers to the following questions:

- The most efficient location for a fitness club aimed at office workers

- The most efficient location for a family oriented fitness club.


### Target of this project

The target audience of this project are investors, as well as large networks of fitness centers that want to effectively use their funds in the context of economic recovery.


Data  <a name="data"></a>
----------------

**Data to solve the problem**

Data expected to be used in preparing this project:

1. Distribution of population density in Moscow, Russia
2. Distribution of offices and colleges in Moscow, Russia
2. Distribution of gyms in Moscow, Russia

**Data sources and methods to analyze**

All these data are available from information about venue data.

First of all we get geographical boundaries of Moscow from [OpenStreetMap](https://www.openstreetmap.org/) and use method Monte Carlo to put random points inside the boundaries. 

OpenStreetMap is built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world. 
It emphasizes local knowledge. Contributors use aerial imagery, GPS devices, and low-tech field maps to verify that OSM is accurate and up to date. 
OpenStreetMap's community is diverse, passionate, and growing every day. Our contributors include enthusiast mappers, GIS professionals, engineers running the OSM servers.


The reason of this approach is limitations of gathering data about venue data near the selected points. 

After that, we will use Foursquare API to get the venue data for those points. Foursquare has one of the largest database of 105+ million places. Foursquare API will provide many categories of the venue data.

We will use these categories to estimate the needed data in order to help us to solve the business problem put forward. 

This is a project that will make use of many data science skills, from web scraping (Wikipedia), working with API (Foursquare), data cleaning, data wrangling, to machine learning (K-means clustering) and map visualization (Folium).

In [6]:
import pandas as pd
import numpy as np
import bs4
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

In [2]:
# Use foursquare API 
# define Foursquare Credentials and Version
CLIENT_ID = 'your Foursquare ID' # your Foursquare ID
CLIENT_SECRET = 'your Foursquare Secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

from config import CLIENT_ID, CLIENT_SECRET, ACCESS_TOKEN
VERSION = '20180604'
LIMIT = 30

Your credentails:
CLIENT_ID: your Foursquare ID
CLIENT_SECRET:your Foursquare Secret


Download data about Foursquare categories and save it to csv to prevent extra downloads.

__Run this cell only the first time__ It stores as raw to prevent extra network calls

restore data about Foursquare categories from csv

In [7]:
df = pd.read_csv('FourSquare_Cats.csv')
df.set_index(df.columns[0], inplace=True)

cat_group = {}
for x in df[df.columns].values:
    cat_group[x[0]] = [y for y in x if type(y) == str]
def group_category(detailed):
    for key in cat_group:
        if detailed in cat_group[key]:
            return key
            
    return 'NONE'

#Check that 'Bakery' category belongs to the 'Food' category
print(group_category('Bakery'))

Food


Get polygon of the Moscow using Open Street Map relation and special utility to get plygon using relation id
__Run this cell only the first time__ It stores as raw to prevent extra network calls

Restore polygon and make function

In [4]:
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon

polygons = []
min_lat= 90
max_lat= -90
min_lng= 180
max_lng= -180
_bnds = []

state = None # 'HEAD', 'BODY'
with open('Moscow.poly', 'r') as reader:
    for line in reader:
        line = line.rstrip('\n')
        if state == None and line == 'polygon':
            state = 'HEAD'
        elif state == 'HEAD':
            if line == 'END':
                state = None
            else:
                ind = int(line)
                state = 'BODY'
                _bnds = []
        elif state == 'BODY':
            if line == 'END':
                state = 'HEAD'
                polygons.append(Polygon(_bnds))
            else:
                pts = [float(y) for y in line.split('\t')[1:]]
                if min_lat > pts[1]:
                    min_lat = pts[1]
                if max_lat < pts[1]:
                    max_lat = pts[1]
                if min_lng > pts[0]:
                    min_lng = pts[0]
                if max_lng < pts[0]:
                    max_lng = pts[0]
                _bnds.append(tuple(pts))

                
def check_point_in_moscow(lat, lng):
    point = Point(lng, lat)
    for p in polygons:
        if p.contains(point):
            return True
    return False

print(f"min lat: {min_lat} max lat: {max_lat} min_lng: {min_lng} max_lng: {max_lng}")

min lat: 55.4913076 max lat: 55.9577717 min_lng: 37.290502 max_lng: 37.9674277


In [5]:
# check that Moscow contains Moscow :-)
import folium
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Moscow'

geolocator = Nominatim(user_agent='foursquare_agent')
location = geolocator.geocode(address)

print(location.latitude, location.longitude)
print(check_point_in_moscow(location.latitude, location.longitude))

55.7504461 37.6174943
True


Create random set of points inside the boundaries of Moscow. Set containts 700 points

__Run this cell only the first time__ It stores as raw to prevent extra calls

Restore points from stored points to make the results repeatable

In [6]:
grid_df = pd.read_csv('MonteCarlo_Moscow.csv')
grid_df.set_index(grid_df.columns[0], inplace=True)

In [7]:
import folium
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Moscow'

geolocator = Nominatim(user_agent='foursquare_agent')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_moscow = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lng in grid_df[['Latitude', 'Longitude']].values:
    folium.CircleMarker(
        [lat, lng],
        radius=1,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_moscow)  
    
map_moscow

Gather data from Foursqure about selected random points

**Run this cell only the first time** It stores as raw to prevent extra calls

Restore foursquare points from saved csv file

In [8]:
explore_df = pd.read_csv('Explore.csv')
explore_df.set_index(explore_df.columns[0], inplace=True)
explore_df.head()

Unnamed: 0_level_0,Latitude,Longitude,VenueId,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,55.657483,37.652892,513f62a1e4b01b1784408764,Каширское шоссе 24-22,55.657847,37.654895,Hobby Shop
1,55.657483,37.652892,4efc5b830aafc58694d835b0,Крошка Картошка,55.655958,37.652947,Fast Food Restaurant
2,55.657483,37.652892,58b29c5501f07761981c5ff7,Скульптура Императрицы Елизаветы Петровны,55.65632,37.65581,Sculpture Garden
3,55.797029,37.54853,5389a1a9498e0f4310e92932,Кальян Мир,55.797362,37.548528,Smoke Shop
4,55.797029,37.54853,50095d5ee4b0ffb333041831,Пражка,55.797402,37.548507,Brewery


### Data understanding

In [23]:
heat_df = pd.DataFrame(explore_df[["Latitude", "Longitude"]].value_counts())
heat_df.columns = ["Total"]
heat_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Total
Latitude,Longitude,Unnamed: 2_level_1
55.753542,37.637273,50
55.761062,37.613418,41
55.761748,37.620107,38
55.759814,37.648517,37
55.763137,37.620788,32
...,...,...
55.761420,37.463328,1
55.776103,37.765513,1
55.778499,37.780898,1
55.781578,37.773483,1


In [12]:
from folium import plugins
map_moscow = folium.Map(location=[latitude, longitude], zoom_start=9)

h_df = explore_df[["VenueLatitude", "VenueLongitude"]].values
map_moscow.add_child(plugins.HeatMap(h_df, radius=15))

In [13]:
map_moscow = folium.Map(location=[latitude, longitude], zoom_start=9)

h_df = grid_df[["Latitude", "Longitude"]].values
map_moscow.add_child(plugins.HeatMap(h_df, radius=15))

Methodology <a name="methodology"></a>
-------

We are going to prepare three sets of data:

- The first one based on distribution of groceries. It should give me the distribution of population in Moscow 

- The second one based on distribution of colleges and offices. It should give me the distribution of where people spend their working days

- The third is based on distribution of fitness clubs. It gives the distribution of fitness clubs

Then we'll make the DBSCAN clusterisation for each dataset. The goal of clusterisation is to divide each set in two clusters. The first with high density and the second with low density (outliers). To make it we are going to join all clusters gathered by DBSCAN into the first.

On the next step we are going to compare the density clusters of population and fitness clubs. If somewhere the density of population is high and the density of fitness clubs is low then it's a good place to make new fitness club for families.

Then we are going to compare density clusters of ofices and fitness clubs in the same way. It gives the location where to make new fitness clubs for office workers.

On the last step we use reverse geocoding to restore Address from geolocations to get addresses with lack of gyms.

### Data preparation

In [29]:
population_df = explore_df[explore_df['VenueCategory'].isin( 
    ['Baby Store', 'Candy Store', 'Chocolate Shop', 'Cosmetics Shop',  'Grocery Store', 'Discount Store',
     'Accessories Store','Boutique','Kids Store','Lingerie Store','Men''s Store','Shoe Store',
     'Dive Shop','Drugstore','Dry Cleaner','Fabric Shop','Fireworks Store','Fishing Store','Floating Market',
     'Flower Shop','Food & Drink Shop','Frame Store','Fruit & Vegetable Store','Furniture / Home Store',
     'Gaming Cafe','Garden Center','Hardware Store','Health & Beauty Service','Herbs & Spices Store',
     'Hobby Shop','Home Service','Kitchen Supply Store','Knitting Store','Laundry Service',
     'Locksmith','Lottery Retailer','Market','Mattress Store','Miscellaneous Shop',
     'Music Store','Lighting Store','Daycare','Women''s Store','Beer Store','Butcher','Cheese Shop',
     'Coffee Roaster','Dairy Store','Farmers Market','Fish Market','Food Service','Gourmet Shop',
     'Grocery Store','Health Food Store','Imported Food Shop','Liquor Store','Organic Grocery',
     'Sausage Shop','Supermarket','Wine Shop','Nail Salon','Notary','Optical Shop','Other Repair Shop',
     'Outdoor Supply Store','Pawn Shop','Perfume Shop','Pet Service','Pet Store','Pharmacy','Photography Lab',
     'Photography Studio','Piercing Parlor','Pop-Up Shop','Public Bathroom','Salon / Barbershop',
     'Sauna / Steam Room','Shipping Store','Shoe Repair','Shopping Plaza','Skate Shop','Ski Shop',
     'Smoke Shop','Smoothie Shop','Spa','Sporting Goods Shop','Supplement Shop','Tailor Shop','Tanning Salon',
     'Thrift / Vintage Store','Toy / Game Store','Used Bookstore','Vape Store','Video Game Store',
     'Video Store','Warehouse Store','Watch Shop'])]

population_df.shape

(706, 7)

In [40]:
offices_df = explore_df[[group_category(x) in ['College & University', 'Professional & Other Places'] for x in explore_df['VenueCategory']]]
offices_df.shape

(22, 7)

In [39]:
gym_df = explore_df[explore_df['VenueCategory'] == 'Gym / Fitness Center']
gym_df.shape

(59, 7)

Analysis <a name="analysis"></a>
---

Call DBSCAN to find the area with low concentrations of population density

In [101]:
from sklearn.cluster import DBSCAN 
epsilon = 0.03
minimumSamples = 27
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(population_df[["VenueLatitude", "VenueLongitude"]].values)
labels_population = db.labels_
print(set(labels_population))
len([x for x in labels_population if x == -1])

map_moscow = folium.Map(location=[latitude, longitude], zoom_start=9)

h_df = population_df[[x == -1 for x in labels_population]][["Latitude", "Longitude"]].values
map_moscow.add_child(plugins.HeatMap(h_df, radius=15))

{0, 1, 2, -1}


Call DBSCAN to find the area with low concentrations of offices and colleges

In [102]:
epsilon = 0.10
minimumSamples = 6
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(offices_df[["VenueLatitude", "VenueLongitude"]].values)
labels_office = db.labels_
print(set(labels_office))
len([x for x in labels_office if x == -1])

map_moscow = folium.Map(location=[latitude, longitude], zoom_start=9)

h_df = offices_df[[x == -1 for x in labels_office]][["Latitude", "Longitude"]].values
map_moscow.add_child(plugins.HeatMap(h_df, radius=15))

{0, -1}


Call DBSCAN to find the area with low concentrations of gyms

In [103]:
epsilon = 0.06
minimumSamples = 7
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(gym_df[["VenueLatitude", "VenueLongitude"]].values)
labels_gym = db.labels_
print(set(labels_gym))
len([x for x in labels_gym if x == -1])

map_moscow = folium.Map(location=[latitude, longitude], zoom_start=9)

h_df = gym_df[[x == -1 for x in labels_gym]][["Latitude", "Longitude"]].values
map_moscow.add_child(plugins.HeatMap(h_df, radius=15))

{0, 1, -1}


In [109]:
result_df = grid_df.copy()
result_df['OverPopulation'] = np.nan
result_df['OverOffices'] = np.nan
result_df['OverGyms'] = np.nan

for i in range(len(population_df)):
    r = population_df.iloc[i]
    ovr = 1
    if labels_population[i] == -1:
        ovr = 0
    result_df.loc[(result_df['Latitude'] == r['Latitude']) & (result_df['Longitude'] == r['Longitude']), 'OverPopulation'] = ovr
    
for i in range(len(offices_df)):
    r = offices_df.iloc[i]
    ovr = 1
    if labels_office[i] == -1:
        ovr = 0
    result_df.loc[(result_df['Latitude'] == r['Latitude']) & (result_df['Longitude'] == r['Longitude']), 'OverOffices'] = ovr
    
for i in range(len(gym_df)):
    r = gym_df.iloc[i]
    ovr = 1
    if labels_gym[i] == -1:
        ovr = 0
    result_df.loc[(result_df['Latitude'] == r['Latitude']) & (result_df['Longitude'] == r['Longitude']), 'OverGyms'] = ovr
    
result_df

Unnamed: 0_level_0,Latitude,Longitude,OverPopulation,OverOffices,OverGyms
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,55.657483,37.652892,0.0,,
1,55.797029,37.548530,1.0,1.0,
2,55.863532,37.369478,,,
3,55.888999,37.676407,1.0,,
4,55.522089,37.551675,,,
...,...,...,...,...,...
695,55.790482,37.433513,0.0,,
696,55.761950,37.621878,1.0,,
697,55.831699,37.359530,0.0,,
698,55.708303,37.504880,,,


Get addresses with population density over average and with density of gyms below average level. These places are candidates to put gyms for families.

In [118]:
for lat, lng in result_df[(result_df['OverGyms'] == 0) & (result_df['OverPopulation'] == 1)][['Latitude', 'Longitude']].values:
    coordinates = f'{lat}, {lng}'
    location = geolocator.reverse(coordinates)
    print(location.raw['display_name'])

Школа Свиблово, корпус 6, улица Седова, Свиблово, район Свиблово, Москва, Центральный федеральный округ, 129323, Россия
46 к2, Новочеркасский бульвар, район Марьино, Москва, Центральный федеральный округ, 109369, Россия
9, Тайнинская улица, Лосиноостровский, Лосиноостровский район, Москва, Центральный федеральный округ, 129345, Россия
16 к2, Тайнинская улица, Лосиноостровский, Лосиноостровский район, Москва, Центральный федеральный округ, 129345, Россия
2044, Люблинская улица (дублёр), Марьинский парк, район Марьино, Москва, Центральный федеральный округ, 109369, Россия


Get addresses with offices and colleges over average and with density of gyms below average level. These places are candidates to put gyms for students and office workers

In [119]:
for lat, lng in result_df[(result_df['OverGyms'] == 0) & (result_df['OverOffices'] == 1)][['Latitude', 'Longitude']].values:
    coordinates = f'{lat}, {lng}'
    location = geolocator.reverse(coordinates)
    print(location.raw['display_name'])

46 к2, Новочеркасский бульвар, район Марьино, Москва, Центральный федеральный округ, 109369, Россия


Results and discussion <a name="results"></a>
-------

Our analysis shows that there are areas with a low density of gyms in Moscow. The lowest concentration is located on the outskirts of Moscow.

These areas roughly correspond, but do not coincide with areas with low population density.

As a result, five areas were found that lacked the availability of gyms for residents.

An area with a high concentration of administrative pressure has also been discovered, which appears to be suitable for a gym.

Conclusion <a name="conclusion"></a>
-----------

The goal of this project was to identify areas of Moscow that are promising in terms of gyms location, in order to help stakeholders narrow down their search for the optimal location for a new gym. The distributions of gym density, office density, and population density were calculated based on Foursquare data. These locations were then clustered to create major areas of interest (containing the largest number of potential locations) and the addresses of these area centers were generated to be used as starting points for final stakeholder research.

The final decision on the optimal gym location will be made by stakeholders based on the specific characteristics of the neighborhoods and locations in each recommended area, taking into account additional factors such as the attractiveness of each location, prices, etc.