## Best Spot to Open a Cafe in Washington DC

### Introduction / Business Problem

In this project, I will combine Washington DC census tract data that gives the population and median income for neighborhoods in Washington DC with location data from Foursquare on the number of cafes in each neighborhood, along with their price levels. If an entrepreneur is interested in opening a cafe, they would likely be interested in knowing which neighborhoods have fewer cafes than average, considering their population and median income. If an entrepreneur can find a neighborhood with a high population and high median income, but fewer cafes than expected, then this could be an ideal place to begin searching for a place to open up a new business. 

### Data

I will be using the 2010 Census Tract data from Washington DC, which gives the census tract number, total population, and median income for each census tract in Washington DC. The data set also includes a set of coordinates that outline the border of each census tract, and I will use these coordinates to find the central point (or at least, a central point) of each census tract. This central point will then be used as the central point for drawing the radius to gather data on cafes in the areas through the Foursquare API.

2010 Census Tract Data with border coordinates for each census tract: https://raw.githubusercontent.com/benbalter/dc-maps/master/maps/census-tracts-2010.geojson

Data on census tracts and the foursquare data has been downloaded and prepared below.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!pip install geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!pip install folium # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')


Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: --yes
Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/fd/a0/ccb3094026649cda4acd55bf2c3822bb8c277eb11446d13d384e5be35257/folium-0.10.1-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 18.2MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/81/6d/31c83485189a2521a75b4130f1fee5364f772a0375f81afff619004e5237/branca-0.4.0-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.0 folium-0.10.1
Libraries imported.


In [2]:
!wget -q -O 'dc_data.json' https://raw.githubusercontent.com/benbalter/dc-maps/master/maps/census-tracts-2010.geojson
print('Data downloaded!')

Data downloaded!


In [3]:
with open('dc_data.json') as json_data:
    dc_data = json.load(json_data)

In [4]:
#calculate the center of each district based on the coordinates of the borders
# method suggested at: https://stackoverflow.com/questions/3081021/how-to-get-the-center-of-a-polygon-in-google-maps-v3
def get_center_of_district(district):
    lats_list = []
    longs_list = []
    for element in district:
        longs_list.append(element[0])
        lats_list.append(element[1])
    lat_max = max(lats_list)
    lat_min = min(lats_list)
    long_max = max(longs_list)
    long_min = min(longs_list)
    central_lat = (lat_max + lat_min) / 2
    central_long = (long_max + long_min) / 2
    return central_long, central_lat

In [5]:
column_names = ['tract_id', 'geo_id', 'total_pop', 'total_pop_18+', 'median_income', 'long', 'lat']
df_dcdata = pd.DataFrame(columns=column_names)
df_dcdata

Unnamed: 0,tract_id,geo_id,total_pop,total_pop_18+,median_income,long,lat


In [6]:
#populate dataframe with needed data, including the center of each census tract
for i in range(0,len(dc_data['features'])):
    tract_id = dc_data['features'][i]['properties']['TRACT']
    geo_id = dc_data['features'][i]['properties']['GEOID']
    tot_pop = dc_data['features'][i]['properties']['P0010001']
    tot_pop_18 = dc_data['features'][i]['properties']['P0030001']
    med_income = dc_data['features'][i]['properties']['FAGI_MEDIAN_2010']
    lst_coords = dc_data['features'][i]['geometry']['coordinates'][0]
    mid_longitude, mid_latitude = get_center_of_district(lst_coords)
    df_dcdata = df_dcdata.append({'tract_id': tract_id,
                                          'geo_id': geo_id,
                                          'total_pop': tot_pop,
                                          'total_pop_18+': tot_pop_18,
                                          'median_income': med_income,
                                            'long': mid_longitude,
                                             'lat': mid_latitude}, ignore_index=True)

In [7]:
df_dcdata.head()

Unnamed: 0,tract_id,geo_id,total_pop,total_pop_18+,median_income,long,lat
0,1001,11001001001,7436,5918,114136.5,-77.089557,38.949481
1,1002,11001001002,3442,3226,74658.0,-77.079024,38.939686
2,4001,11001004001,3745,3486,72807.0,-77.046452,38.919678
3,4002,11001004002,2797,2654,60460.5,-77.043998,38.918528
4,4100,11001004100,2708,2482,87019.0,-77.052629,38.915475


In [8]:
address = 'Washington, DC'

geolocator = Nominatim(user_agent="dc_explorer")
location = geolocator.geocode(address)
print(type(location))

<class 'geopy.location.Location'>


In [9]:
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Washington DC are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Washington DC are 38.8949855, -77.0365708.


In [10]:
# create map of Washington DC using latitude and longitude values
map_dc = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, tract, med_income in zip(df_dcdata['lat'], df_dcdata['long'], df_dcdata['tract_id'], df_dcdata['median_income']):
    label = '{}, {}'.format(tract, med_income)
    label = folium.Popup(label, parse_html=True)
    marker_color = ''
    filling_color = ''
    if med_income < 35000:
        marker_color = 'red'
        filling_color = 'lightred'
    elif med_income >= 35000 and med_income < 75000:
        marker_color = 'green'
        filling_color = 'lightgreen'
    else:
        marker_color = 'blue'
        filling_color = 'lightblue'
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=marker_color,
        fill=True,
        fill_color=filling_color,
        fill_opacity=0.7,
        parse_html=False).add_to(map_dc)  
    
map_dc

In [12]:
# generate choropleth map using the median income of each tract, to show the boundaries of each district. 
# This uses the data for the boundaries of each census tract that were included in the geojson file.
# The method for finding the centerpoint of each district has generally worked well, but some of the points have not been placed optimally so this will need to be considered in the analysis.
map_dc.choropleth(
    geo_data=dc_data,
    data=df_dcdata,
    columns=['tract_id', 'median_income'],
    key_on='feature.properties.TRACT',
    fill_color='YlOrRd', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Immigration to Canada'
)

# display map
map_dc



#### Define Foursquare Credentials and download data on venues by census tract

In [13]:
CLIENT_ID = 'YFH5AOLVXR0NVCA0D5MPBFKLXZ1ST1X2VH3ER45JIGHIVFZN' # your Foursquare ID
CLIENT_SECRET = 'VEDKRJA3CMXT1D04TWLDRABSXU3YJLWF1G34NKW1I4AEFUA0' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: YFH5AOLVXR0NVCA0D5MPBFKLXZ1ST1X2VH3ER45JIGHIVFZN
CLIENT_SECRET:VEDKRJA3CMXT1D04TWLDRABSXU3YJLWF1G34NKW1I4AEFUA0


#### Define a function to download the data on nearby venues within a 500 meter radius of the center of the census tract

In [15]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

def getNearbyVenues(tract_ids, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for tract_id, lat, lng in zip(tract_ids, latitudes, longitudes):
        print(tract_id)
         
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            tract_id, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Tract_id', 
                  'Tract Center Lat', 
                  'Tract Center Long', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [17]:
dc_venues = getNearbyVenues(tract_ids=df_dcdata['tract_id'],
                                   latitudes=df_dcdata['lat'],
                                   longitudes=df_dcdata['long']
                                  )

001001
001002
004001
004002
004100
004201
004202
004300
004400
001100
001200
001301
001302
001401
001402
000502
003400
003500
003600
003700
003800
003900
006500
001500
001600
001702
000100
000201
000202
000300
000400
000501
007304
007401
007403
007404
007406
007407
007408
007409
007502
004600
004701
004702
004801
004802
004901
004902
005001
005002
005201
005301
005500
005600
005800
005900
006202
006400
006600
006700
006801
006802
006804
006900
007000
007100
000600
000701
000702
000801
000802
000901
001803
001804
001901
000902
003301
003302
008804
008903
008904
009000
009102
009201
009203
009204
007200
007301
002301
002302
002400
002501
002502
002600
002701
007601
007603
007604
007605
007703
007707
007708
007709
007803
007804
007806
002702
002801
002802
002900
003000
003100
003200
009811
009901
009902
009903
009904
009905
009906
009907
010100
010200
010300
010400
007503
007504
001902
002001
002002
002101
002102
002201
002202
009302
009400
009501
009503
009504
009505
009507
009508
009509

In [18]:
print(dc_venues.shape)
dc_venues.head()

(3987, 7)


Unnamed: 0,Tract_id,Tract Center Lat,Tract Center Long,Venue,Venue Latitude,Venue Longitude,Venue Category
0,1002,38.939686,-77.079024,Red Hook Lobster Pound DC,38.939513,-77.078287,Food Truck
1,1002,38.939686,-77.079024,The Spa Room,38.942982,-77.076708,Massage Studio
2,1002,38.939686,-77.079024,Sullivan's Toy Store,38.943748,-77.077712,Toy / Game Store
3,1002,38.939686,-77.079024,Feelin' Crabby,38.939441,-77.07523,Food Truck
4,1002,38.939686,-77.079024,Bourbon Coffee,38.943671,-77.077613,Coffee Shop
