# IBM Data Science Specialization - Final Project

This is the notebook for Darvesh Gorhe's capstone project for the IBM Data Specialization on Coursera

# A. Introduction

### A.1 Problem Outline
For this project, I wanted to compare the coffee scene between the neighborhoods in Seattle. The purpose of this is to understand the distribution of coffee shops in the various neighborhoods to see which neighborhoods have high competition among coffee shops and where to open a new coffee shop. 

The data presented will be focused on two main audiences: those wanting to open a coffee shop as a small business and coffee chains (e.g. Starbucks, Peet's Tea & Coffee, etc). I picked these two audiences because the same data and analysis would yield different conclusions for each audience. For a large chain, opening a new storefront in an area with many coffee shops could mean that they are competing with their own neighboring storefronts. However, for someone wanting to start a small business, an already dense location would be more optimal since there is already foot traffic. 

### A.2 Data Used
I will be using the following data from each city for my analysis:
* Absolute number of coffee shops 
* The number of coffee shops per neighborhood
* Latitude and Longitude of each coffee shop
* Addresses of each coffee shop

# B. Collecting Data

## B.1 Extracting Neighborhood from ~~Wikipedia~~ a geojson file

In [2]:
import json

with open ('seattle.json', 'r') as f:
    seattle_geo = json.load(f)
    
seattle_neighborhoods = []    

for n in range(0,len(seattle_geo['features'])):
    seattle_neighborhoods.append(seattle_geo['features'][n]['properties']['name'])

## B.2 Geocoding Neighborhoods

In [18]:
# Geocoding Neighborhoods
import pandas as pd
import sys
import geopy
from geopy.exc import GeocoderTimedOut
from geopy.geocoders import Nominatim
geopy.geocoders.options.default_timeout = 7

geolocator = Nominatim(user_agent='coursera_project_final')

    
df_cols = ['Neighborhood', 'Latitude', 'Longitude']
df_coordinates = pd.DataFrame(columns=df_cols)
exceptions = []

for hood in seattle_neighborhoods:
    try:
        location = geolocator.geocode(hood+", Seattle")
    except GeocoderTimedOut as e:
        print("Could not retrieve location for %s, Seattle \n"%(hood))
        exceptions.append(hood)

    if location != None:
        lat = location.latitude
        lon = location.longitude

    df_coordinates = df_coordinates.append({"Neighborhood": hood,
                                       "Latitude": lat,
                                       "Longitude": lon}, 
                                       ignore_index=True)
    sys.stdout.write('|')   


try_count = 1

while (len(exceptions) != 0) and (try_count <= 10):   
    for e in exceptions:
        try:
            location = geolocator.geocode(e+", Seattle")
        except GeocoderTimedOut:
            print("Could not retrieve location for %s, Seattle"%(e))

        if location != None:
            lat = location.latitude
            lon = location.longitude  

        df_coordinates.loc[df_coordinates['Neighborhood'] == e, 'Latitude'] = lat
        df_coordinates.loc[df_coordinates['Neighborhood'] == e, 'Longitude'] = lon
    try_count += 1

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

In [19]:
df_coordinates.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Loyal Heights,47.688709,-122.392907
1,Adams,47.565271,-122.279546
2,Whittier Heights,47.683297,-122.371449
3,West Woodland,47.675973,-122.347499
4,Sunset Hill,47.675217,-122.398448


## B.3 Getting Venues in Neighborhoods

In [21]:
from configparser import ConfigParser

config = ConfigParser()
config_path = 'config.cfg'
config.read('config.cfg')

CLIENT_ID = config.get('client_tokens', 'client_id')
CLIENT_SECRET = config.get('client_tokens', 'client_secret')

In [23]:
import requests

VERSION = '20200326'
category_id = '4bf58dd8d48988d1e0931735' #coffee shop category id
radius = 500
LIMIT = 1000

coffee_list = []

for index, row in df_coordinates.iterrows():
    
    url = 'https://api.foursquare.com/v2/venues/explore?&query=coffee&client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&radius={}&limit={}'\
    .format(CLIENT_ID, 
            CLIENT_SECRET, 
            row['Latitude'], 
            row['Longitude'], 
            VERSION, 
            category_id, 
            radius, 
            LIMIT)

    results = requests.get(url).json()['response']['groups'][0]['items']

    coffee_list.append([(
        row['Neighborhood'], 
        row['Latitude'],
        row['Longitude'],
        v['venue']['name'],
        v['venue']['location']['lat'],
        v['venue']['location']['lng'],
        v['venue']['categories'][0]['name']) for v in results])
    
    sys.stdout.write('|')

print('Done with API calls')
    
coffee_shops = pd.DataFrame([item for coffee_shop in coffee_list for item in coffee_shop])
coffee_shops.columns = ['Neighborhood',
                       'Neighborhood Latitude',
                       'Neighborhood Longitude',
                       'Venue',
                       'Venue Latitude',
                       'Venue Longitude',
                       'Venue Category']

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Done with API calls


In [24]:
len(coffee_shops)

1461

# C. Methodology

## C.1 Total Coffee Shops per Neighborhood

In [40]:
# Grouping larger dataframe by neighborhoods and counting venues
total_count = coffee_shops.groupby(['Neighborhood']).count()
total_count = total_count[['Venue']]
total_count.reset_index(inplace=True)

total_count.columns = ['name', 'Coffee Shops']

total_count.head()

Unnamed: 0,name,Coffee Shops
0,Algona,1
1,Alki,5
2,Atlantic,1
3,Auburn,4
4,Beaux Arts,4


In [61]:
import plotly.express as px

seattle_geo = r'./seattle.json' #GEOjson of Seattle Neighborhoods

seattle_lat = 47.6062
seattle_lon = -122.3321

fig = px.choropleth_mapbox(total_count, 
                    geojson=seattle_geo,
                    color_continuous_scale='Viridis',
                    range_color=(0,50),
                    locations='name',
                    center = {'lat':seattle_lat, 'lon':seattle_lon},
                    mapbox_style='carto-positron',
                    featureidkey='properties.name', 
                    color='Coffee Shops')



fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

fig.show()

fig.write_html('./total_count.html')

## C.2 Unique Coffee Shops per Neighborhood

In [66]:
unique_count = coffee_shops.groupby('Neighborhood')['Venue'].nunique()
unique_count = unique_count.to_frame()
unique_count.reset_index(inplace=True)

unique_count.columns = ['name', 'Unique Coffee Shops']

In [68]:
fig_unique = px.choropleth_mapbox(unique_count, 
                    geojson=seattle_geo,
                    color_continuous_scale='Viridis',
                    range_color=(0,50),
                    locations='name',
                    center = {'lat':seattle_lat, 'lon':seattle_lon},
                    mapbox_style='carto-positron',
                    featureidkey='properties.name', 
                    color='Unique Coffee Shops')



fig_unique.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

fig_unique.show()

## C.3 Top 10 Neighborhoods by Total Number of Coffee Shops

In [None]:
%matplotlib inline

top10 = total_count.sort_values('Venue', ascending=False)
top10 = top10.iloc[0:10,:]

fig = plt.Figure(figsize=(14,10))

plt.bar(height=top10['Venue'], x=top10['Neighborhood'], align='edge')
plt.xticks(rotation='vertical')
plt.xlabel('Neighborhoods in Seattle')
plt.ylabel('Number of Coffee Shops')
plt.show()

## C.4 Top 10 Neighborhoods by Unique Number of Coffee Shops

In [None]:
unique_count.reset_index()

top10_unique = unique_count.sort_values('Venue', ascending=False)
top10_unique = top10_unique.iloc[0:10]

top10_unique = top10_unique.reset_index()


fig = plt.Figure(figsize=(14,10))
plt.bar(height=top10_unique['Venue'], x=top10_unique['Neighborhood'], align='edge')

plt.xticks(rotation='vertical')
plt.xlabel('Neighborhoods in Seattle')
plt.ylabel('Number of Unique Coffee Shops')
plt.show()

In [None]:
import shapely.geometry
import pyproj
import math

def latlon_to_xy(lat, lon):
    proj_latlon = pyproj.Proj(proj='latlong', datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_latlon(x, y):
    proj_latlon = pyproj.Proj(proj='latlong', datum='WGS84')
    proj_xy = pyproj.Proj(proj='utm', zone=33, datum='WGS84')
    latlon = pyproj.transform(proj_xy, proj_latlon, x, y)
    return latlon[0], latlon[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

In [None]:
# Testing Coordinate System

x, y = latlon_to_xy(seattle_lat, seattle_lon)

print(x, y)