# Coursera Data Science Capstone
### Battle of the Neighborhoods - Final Project

## Introduction


In this final project, we will explore the optimal location for adding four food service stores in Atlanta, GA, USA. These are stores designed to be suppliers for area restaurants.

We want these four stores to be **in areas with dense clusters of restaurants** to maximize the potential customer base. These stores should also be **evenly spread throughout the city** to maximize the footprint throughout the city. Lastly, according to the principles of game theory and the Nash equilibrium, we will want to consider **close proximity to existing food service stores** of competitors.

We will using data science methodology to find 4 suitible locations to open food service stores.

In [1]:
import pandas as pd
import numpy as np

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: # uncomment this line if you haven't completed the Foursquare API lab


Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


First we need to get regional data for Atlanta, GA.Since our service area will include surrounding metro Atlanta, we will get a list of surrounding areas. Atlanta has 29 counties that is considered Metro Atlanta so we will **pull a list of the counties and population data** from the "Atlanta Metropolitan Area wikipedia page.

In [2]:
atl_wiki = pd.read_html('https://en.wikipedia.org/wiki/Atlanta_metropolitan_area')

In [3]:
atl_df = atl_wiki[3]
atl_df.head()

Unnamed: 0,County,Seat,2019 Estimate,2010 Census,Change,Area,Density
0,Fulton *,Atlanta,1063937,920581,+15.57%,"534 sq mi (1,380 km2)",2)
1,Gwinnett *,Lawrenceville,936250,805321,+16.26%,"437 sq mi (1,130 km2)",2)
2,Cobb *,Marietta,760141,688078,+10.47%,345 sq mi (890 km2),2)
3,DeKalb *,Decatur,759297,691893,+9.74%,271 sq mi (700 km2),2)
4,Clayton *,Jonesboro,292256,259424,+12.66%,144 sq mi (370 km2),2)


Now that we have our data, we will need to cleanse our data into a usable format. First, we will drop some of the columns that we will not need for our analysis. Next, we will drop the "total" row. Finally, we will drop the astrisk from the County field.

In [4]:
atl_df = atl_df.drop(['Seat', '2010 Census','Change','Area','Density'], axis=1)
atl_df.head()

Unnamed: 0,County,2019 Estimate
0,Fulton *,1063937
1,Gwinnett *,936250
2,Cobb *,760141
3,DeKalb *,759297
4,Clayton *,292256


In [5]:
atl_df = atl_df.drop([29], axis=0)
atl_df.head()

Unnamed: 0,County,2019 Estimate
0,Fulton *,1063937
1,Gwinnett *,936250
2,Cobb *,760141
3,DeKalb *,759297
4,Clayton *,292256


In [6]:
atl_df['County'] = atl_df['County'].map(lambda x: x.rstrip('*'))
atl_df.head()

Unnamed: 0,County,2019 Estimate
0,Fulton,1063937
1,Gwinnett,936250
2,Cobb,760141
3,DeKalb,759297
4,Clayton,292256


Now that we have our Wikipedia table data cleaned up, let's extract a list of the counties so we can use the list to find coordinates of each.

In [7]:
counties = atl_df['County'].tolist()
counties

['Fulton ',
 'Gwinnett ',
 'Cobb ',
 'DeKalb ',
 'Clayton ',
 'Cherokee ',
 'Forsyth ',
 'Henry ',
 'Paulding ',
 'Coweta ',
 'Douglas ',
 'Carroll',
 'Fayette ',
 'Newton',
 'Bartow ',
 'Walton',
 'Rockdale ',
 'Barrow',
 'Spalding',
 'Pickens',
 'Haralson',
 'Dawson',
 'Butts',
 'Meriwether',
 'Lamar',
 'Morgan',
 'Pike',
 'Jasper',
 'Heard']

Now that we have our list, let's create a dataframe and initate the dataframe column names.

In [8]:
# define the dataframe columns
column_names = ['County', 'Latitude', 'Longitude'] 

# instantiate the dataframe
county_loc_data = pd.DataFrame(columns=column_names)

In [9]:
    address = 'Atlanta, GA'

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    
print('Latitude =',latitude)
print('Longitude =', longitude)

Latitude = 33.081484
Longitude = -84.3969481


We will loop through each county in the list and populate our dataframe with county and coordinates data. Due to Nominatim's user agreement, we will need to pause for 1 sec during each loop.

In [10]:
from time import sleep

for county in counties:
    
    # geocode doesn't recognize Pike, GA so we have to rewrite it to Pike County, GA to get correct coordinates
    if county == 'Pike':
        address = county + ' County, GA'
        
    elif county == 'Walton':
        address = county + ' County, GA'
        
    else:
        address = county + ', GA'

    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude

    county_loc_data = county_loc_data.append({'County': county,
                                          'Latitude': latitude,
                                          'Longitude': longitude}, ignore_index=True)
    sleep(1)

In [11]:
county_loc_data

Unnamed: 0,County,Latitude,Longitude
0,Fulton,34.058039,-84.296128
1,Gwinnett,33.956687,-84.022747
2,Cobb,33.937395,-84.573201
3,DeKalb,33.757561,-84.218651
4,Clayton,33.520496,-84.359171
5,Cherokee,34.239126,-84.472027
6,Forsyth,34.235309,-84.133564
7,Henry,33.451994,-84.136777
8,Paulding,33.890853,-84.856916
9,Coweta,33.350066,-84.754573


Finally, we will join our two datafames together into a single dataframe containing our location data. 

In [12]:
atl_df = pd.merge(atl_df,county_loc_data, on='County', how='left')
atl_df.head()

Unnamed: 0,County,2019 Estimate,Latitude,Longitude
0,Fulton,1063937,34.058039,-84.296128
1,Gwinnett,936250,33.956687,-84.022747
2,Cobb,760141,33.937395,-84.573201
3,DeKalb,759297,33.757561,-84.218651
4,Clayton,292256,33.520496,-84.359171


In [13]:
# We will find the avg for Lat and Lon in order to center our map
latitude = atl_df['Latitude'].mean(axis=0)
longitude = atl_df['Longitude'].mean(axis=0)
print('Latitude =',latitude)
print('Longitude =', longitude)

Latitude = 33.68305064482758
Longitude = -84.33993787931033


In [14]:
# create map of Atlanta using latitude and longitude values
map_atl = folium.Map(location=[latitude, longitude], zoom_start=8)

# add markers to map
for lat, lng, county in zip(atl_df['Latitude'], atl_df['Longitude'], atl_df['County']):
    label = '{}'.format(county)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=20,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_atl)  
    
map_atl

Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = 'BXUCPOEUBAZMUJNQU4Q1D1CD2K5BL4BLFSDMLDUQRMW0EVFM' # your Foursquare ID
CLIENT_SECRET = '4JGHXK0PMSZBUWDZIYC0NKJA1GA3W2MSYDO2QJG1LSY3KILY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [16]:
food_category = '4f2a25ac4b909258e854f55f' # 'Root' category for all food-related venues



In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=10000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL

        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            food_category,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['County', 
                  'County Latitude', 
                  'County Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [18]:
atl_venues = getNearbyVenues(names=atl_df['County'], latitudes=atl_df['Latitude'], longitudes=atl_df['Longitude'])

Fulton 
Gwinnett 
Cobb 
DeKalb 
Clayton 
Cherokee 
Forsyth 
Henry 
Paulding 
Coweta 
Douglas 
Carroll
Fayette 
Newton
Bartow 
Walton
Rockdale 
Barrow
Spalding
Pickens
Haralson
Dawson
Butts
Meriwether
Lamar
Morgan
Pike
Jasper
Heard


In [19]:
print(atl_venues.shape)
atl_venues.head()

(1742, 7)


Unnamed: 0,County,County Latitude,County Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Fulton,34.058039,-84.296128,Crust Pasta & Pizzeria,34.071709,-84.296617,Italian Restaurant
1,Fulton,34.058039,-84.296128,Collet French Pastry & Cafe,34.070248,-84.293311,French Restaurant
2,Fulton,34.058039,-84.296128,California Pizza Kitchen,34.04717,-84.291365,Pizza Place
3,Fulton,34.058039,-84.296128,Chick-fil-A,34.069276,-84.280967,Fast Food Restaurant
4,Fulton,34.058039,-84.296128,Rising Roll,34.057895,-84.286683,Sandwich Place


In [20]:
atl_venues.groupby('County').count()

Unnamed: 0_level_0,County Latitude,County Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Barrow,67,67,67,67,67,67
Bartow,68,68,68,68,68,68
Butts,24,24,24,24,24,24
Carroll,56,56,56,56,56,56
Cherokee,100,100,100,100,100,100
Clayton,100,100,100,100,100,100
Cobb,100,100,100,100,100,100
Coweta,100,100,100,100,100,100
Dawson,20,20,20,20,20,20
DeKalb,100,100,100,100,100,100


We'll use the Foursquare category ID for any food related venue (4d4b7105d754a06374d81259). Also, we will use Kitchen Supply Store category (58daa1558bbb0b01f18ec1b4) to find competitors. (https://developer.foursquare.com/docs/build-with-foursquare/categories/)