# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

Ou company decided to provide daily free coffee to the employees working at each of the Distirbution Centers located throughout US. In this project we will try to find an optimal location for coffee shops for each individual Distribution Center (further referred to as DC).

We will use our data science analysis to generate closest coffee shops arounf each DC so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of places that sell coffee in the neighborhood 
* distance to coffee places from the DC address

Following data sources will be needed to extract/generate the required information:
* venues that sell coffee and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of each will be obtained using **Geocoding** 

In [1]:
#import libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans



In [2]:
#!conda install -c conda-forge folium
#!pip install folium==0.5.0
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Get data that was loaded in Watson Studio Assets

In [3]:
# The code was removed by Watson Studio for sharing.

In [4]:
df_data_1 = pd.read_csv(body)
df_data_1.head()

Unnamed: 0,DC,City,Street,Zip
0,VHL,Vernon Hills,Milwaukee Ave,60061
1,BUR,Burlington,River Rd,8016
2,ORL,Orlando,Cypress Lake Drive,32837
3,CRL,Carrollton,Trade Center Drive,75261
4,FNT,Fontana,Cabernet Drive,92337


#The Distribution Center are located across US so we need to show all

In [5]:
address = 'US'

geolocator = Nominatim(user_agent="us_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of US are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of US are 39.7837304, -100.4458825.


### Create a data frame from original and add latitude and longitude

In [6]:
# define the dataframe columns
column_names = ['DC', 'City', 'Street', 'Latitude', 'Longitude'] 

# instantiate the dataframe
dcs = pd.DataFrame(columns=column_names)

In [7]:
lat=[]
long=[]

for (index_label, row_series)  in df_data_1.iterrows():
    city = row_series.values[1]
    street = row_series.values[2]
    postalcode = row_series.values[3]
          
    address = '{}, {}, {}'.format(street, city, postalcode)
    #print(address)
    geolocator = Nominatim(user_agent="us_explorer")
    location = geolocator.geocode(address)
    lat.append(location.latitude)
    long.append(location.longitude)
   
    

df_data_1['Latitude']=lat
df_data_1['Longitude']=long
df_data_1

Unnamed: 0,DC,City,Street,Zip,Latitude,Longitude
0,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924
1,BUR,Burlington,River Rd,8016,39.60446,-74.547454
2,ORL,Orlando,Cypress Lake Drive,32837,28.413378,-81.397819
3,CRL,Carrollton,Trade Center Drive,75261,32.990442,-96.93377
4,FNT,Fontana,Cabernet Drive,92337,34.037067,-117.509472
5,SEA,Sumner,32nd Street E,98390,47.228528,-122.254392


### Create US map with all DCs

In [8]:
# create map of New York using latitude and longitude values
map_us= folium.Map(location=[latitude, longitude], zoom_start=5)

# add markers to map
for lat, lng, city, dc in zip(df_data_1['Latitude'], df_data_1['Longitude'], df_data_1['City'], df_data_1['DC']):
    label = '{}, {}'.format(dc, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=True).add_to(map_us)  
    
map_us

### Define limit and radius for all locations

In [9]:
LIMIT = 10
radius= 5000

category='4bf58dd8d48988d1e0931735'

In [10]:
# The code was removed by Watson Studio for sharing.

In [11]:
def getURL(latitude, longitude):
    return  'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)


In [12]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [13]:
# test first DC 

dc1 = df_data_1[df_data_1["DC"]=="SEA"]
url=getURL(df_data_1.loc[0,'Latitude'], df_data_1.loc[0,'Longitude'])
results = requests.get(url).json()
    
results

{'meta': {'code': 200, 'requestId': '5f14e52dd132907d7c508fc5'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Libertyville',
  'headerFullLocation': 'Libertyville',
  'headerLocationGranularity': 'city',
  'totalResults': 98,
  'suggestedBounds': {'ne': {'lat': 42.29869794500004,
    'lng': -87.88624059788772},
   'sw': {'lat': 42.208697854999954, 'lng': -88.00760660211229}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '52fbea09498ee822c9841e2b',
       'name': "Trader Joe's",
       'location': {'address': '1600 S Milwaukee Ave',
        'lat': 42.25516054279883,
        'lng': -87.94690585327373,
        'labeledLatLngs': [{'label': '

In [14]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Trader Joe's,Grocery Store,42.255161,-87.946906
1,Starbucks,Coffee Shop,42.255821,-87.948269
2,Lou Malnati's Pizza,Pizza Place,42.260113,-87.947793
3,Mariano's Fresh Market,Grocery Store,42.256905,-87.950948
4,Lazy Dog Restaurant & Bar,Restaurant,42.245165,-87.944632


### Create a new dataframe with all DCs and venues

In [15]:
columns = ['name', 'categories', 'lat', 'lng', 'dc'] 
df_dcs = pd.DataFrame(columns=columns)
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']

for lat, lng, dc in zip(df_data_1['Latitude'].astype('float64'), df_data_1['Longitude'].astype('float64'), df_data_1['DC']):
    url=getURL(lat,lng)
    results = requests.get(url).json()
    #print(results)
    venues = results['response']['groups'][0]['items']
    
    nearby_venues = json_normalize(venues) # flatten JSON

    # filter columns
    
    nearby_venues =nearby_venues.loc[:, filtered_columns]

    # filter the category for each row
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

    # clean columns
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
     
  
    df_dc= pd.DataFrame(nearby_venues)
    df_dc['DC']=dc
    
    if df_dcs.empty:
        df_dcs = pd.DataFrame(df_dc)
    else:
        df_dcs= df_dcs.append(df_dc)
        
     

df_dcs

Unnamed: 0,name,categories,lat,lng,DC
0,Trader Joe's,Grocery Store,42.255161,-87.946906,VHL
1,Starbucks,Coffee Shop,42.255821,-87.948269,VHL
2,Lou Malnati's Pizza,Pizza Place,42.260113,-87.947793,VHL
3,Mariano's Fresh Market,Grocery Store,42.256905,-87.950948,VHL
4,Lazy Dog Restaurant & Bar,Restaurant,42.245165,-87.944632,VHL
5,AMC Hawthorn 12,Movie Theater,42.243885,-87.948936,VHL
6,Burt's Deli,Deli / Bodega,42.263912,-87.950814,VHL
7,Century Park,Park,42.248269,-87.961159,VHL
8,Pho House: Authentic Vietnamese And Asian Cuisine,Vietnamese Restaurant,42.26439,-87.950893,VHL
9,Portillo's,Hot Dog Joint,42.239861,-87.958918,VHL


### Joining the main DC data with the venues 

In [16]:
df_final = pd.merge(df_data_1,df_dcs, on="DC", how="inner")
df_final.head(10)

Unnamed: 0,DC,City,Street,Zip,Latitude,Longitude,name,categories,lat,lng
0,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Trader Joe's,Grocery Store,42.255161,-87.946906
1,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Starbucks,Coffee Shop,42.255821,-87.948269
2,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Lou Malnati's Pizza,Pizza Place,42.260113,-87.947793
3,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Mariano's Fresh Market,Grocery Store,42.256905,-87.950948
4,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Lazy Dog Restaurant & Bar,Restaurant,42.245165,-87.944632
5,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,AMC Hawthorn 12,Movie Theater,42.243885,-87.948936
6,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Burt's Deli,Deli / Bodega,42.263912,-87.950814
7,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Century Park,Park,42.248269,-87.961159
8,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Pho House: Authentic Vietnamese And Asian Cuisine,Vietnamese Restaurant,42.26439,-87.950893
9,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Portillo's,Hot Dog Joint,42.239861,-87.958918


## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting venues around each DC

In first step we have collected the required **data: distribution centers with all venues on a specific radius**

Second step in our analysis will be to filter out places that wouldn't see just coffee (eg: Pizza place). For a list of categories to exclude we need to get a distinct list of venues

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. 

### Find all distinct categories

In [17]:
cat = df_final['categories'].unique()

print('Distinct categories', cat)



Distinct categories ['Grocery Store' 'Coffee Shop' 'Pizza Place' 'Restaurant' 'Movie Theater'
 'Deli / Bodega' 'Park' 'Vietnamese Restaurant' 'Hot Dog Joint' 'Pub'
 'Harbor / Marina' 'Beach' 'Brewery' 'River' 'Campground' 'Burger Joint'
 'Indian Restaurant' 'Ice Cream Shop' 'Go Kart Track'
 'Latin American Restaurant' 'Convenience Store'
 'Middle Eastern Restaurant' 'Bar' 'Supermarket' 'Mexican Restaurant'
 'Fast Food Restaurant' 'Bakery' 'Sushi Restaurant' 'Korean Restaurant'
 'Café' 'Chinese Restaurant' 'American Restaurant' 'Winery'
 'Sandwich Place' 'Tea Room' 'Donut Shop' 'Italian Restaurant']


### Remove categories that are not selling coffee

In [18]:
df_final = df_final.loc[~df_final['categories'].isin(['Harbor / Marina', 'Movie Theater','Go Kart Track','Pizza Place','Grocery Store','Park'])]
cat = df_final['categories'].unique()

print('Distinct categories', cat)


Distinct categories ['Coffee Shop' 'Restaurant' 'Deli / Bodega' 'Vietnamese Restaurant'
 'Hot Dog Joint' 'Pub' 'Beach' 'Brewery' 'River' 'Campground'
 'Burger Joint' 'Indian Restaurant' 'Ice Cream Shop'
 'Latin American Restaurant' 'Convenience Store'
 'Middle Eastern Restaurant' 'Bar' 'Supermarket' 'Mexican Restaurant'
 'Fast Food Restaurant' 'Bakery' 'Sushi Restaurant' 'Korean Restaurant'
 'Café' 'Chinese Restaurant' 'American Restaurant' 'Winery'
 'Sandwich Place' 'Tea Room' 'Donut Shop' 'Italian Restaurant']


### Calculate the distance from each DC to the venues found

In [19]:
import math 

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

df_final['Distance']  = df_final.apply(lambda x: calc_xy_distance(x['Latitude'],x['Longitude'],x['lat'],x['lng']),axis=1)
df_final.head()

Unnamed: 0,DC,City,Street,Zip,Latitude,Longitude,name,categories,lat,lng,Distance
1,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Starbucks,Coffee Shop,42.255821,-87.948269,0.002514
4,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Lazy Dog Restaurant & Bar,Restaurant,42.245165,-87.944632,0.008835
6,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Burt's Deli,Deli / Bodega,42.263912,-87.950814,0.01093
8,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Pho House: Authentic Vietnamese And Asian Cuisine,Vietnamese Restaurant,42.26439,-87.950893,0.011405
9,VHL,Vernon Hills,Milwaukee Ave,60061,42.253698,-87.946924,Portillo's,Hot Dog Joint,42.239861,-87.958918,0.018312


### Let's print each DC map with the locations found

In [20]:
# add markers to map
from IPython.display import display
for lat, lng, city, dc in zip(df_data_1['Latitude'], df_data_1['Longitude'], df_data_1['City'], df_data_1['DC']):
    dcmap= folium.Map(location=[lat, lng], zoom_start=15)
    label = '{}, {}'.format(dc, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=20,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(dcmap)
    
    venues_dc = df_final[df_final['DC']==dc]
    for lat1, lng1,  name in zip(venues_dc['lat'], venues_dc['lng'], venues_dc['name']):
        dcvenu= folium.Map(location=[lat1, lng1], zoom_start=15)
        label1 = '{}'.format(name)
        label1 = folium.Popup(label1, parse_html=True)
        folium.CircleMarker(
            [lat1, lng1],
            radius=10,
            popup=label1,
            color='red',
            fill=True,
            fill_color='red',
            fill_opacity=0.9,
            parse_html=False).add_to(dcmap)
    display(dcmap)
    


In [21]:
### Let's generate clusters, and create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.

#T

In [22]:
from sklearn.cluster import KMeans


# set number of clusters
kclusters = 5

for index in range(len(df_data_1)):
    
    venue_clustering = df_final[df_final['DC']==df_data_1.loc[index,"DC"]]
    venue_grouped=venue_clustering[["Latitude","Longitude","lat","lng"]]

    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(venue_grouped)

    # check cluster labels generated for each row in the dataframe
    kmeans.labels_[0:10] 
    
    # add clustering labels
    venue_clustering.insert(0, 'Cluster Labels', kmeans.labels_)
    venue_merged = venue_clustering
       
    # create map
    latitude= df_data_1.loc[index,"Latitude"]
    longitude= df_data_1.loc[index,"Longitude"]
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(venue_merged['Latitude'], venue_merged['Longitude'], venue_merged['name'], venue_merged['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    display(map_clusters)

## Results and Discussion <a name="results"></a>

Our analysis was able to provide the stakeholders with list of venues that can provide coffee.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify places that sell coffee in order to aid stakeholders in narrowing down the search for optimal location. We first got all venues from Foursquare for each DC, then we filtered out places not selling coffee and caluclated the distance to the DC. We mapped all remaining venues around the DCs and allow stakeholders to make decisions.