# Applied Data Science

## The Battle of Neighborhoods

This notebook contains a description of the work performed in fulfillment
of the Final Project: "The Battle of Neighborhoods" in the 
"Applied Data Science Capstone", the final course
in the IBM-sponsored "Applied Data Science" Coursera course series.


## Introduction

I recently moved my family from the verdant hills of the Blueridge Mountains
in southwestern Virginia to the arid high-desert of Sierra Vista, in 
southeastern Arizona. Because I had lived in the Phoenix metro-area
of Arizona in the past, I was expecting Sierra Vista to be similar to 
the "Valley of the Sun". I was pleasantly surprised to find that the
climate and countryside of Sierra Vista was nothing like what I experienced
in Phoenix. First, Sierra Vista is a full 15 degrees cooler on most days.
This means that even when Phoenix is experiencing triple-digit heat, it is usually
quite pleasant in Sierra Vista, even during the hotest part of the day. This is
partly explained by the difference in elevation; Phoenix is approximately
1,000 feet above sea level, while Sierra Vista is over 4,600 feet above
sea level (and Miller Peak, which is just outside Sierra Vista, reaches 9,600
feet). This difference in altitude also explains some of the differences
in flora and fauna. Sierra Vista lacks the iconic "Saguaro Cactus" for which
Arizona is so famous, and I never ran into any bears in Phoenix (except at the
Phoenix Zoo). These, and a number of other differences, made me want to use
my Capstone project to investigate some aspect of Sierra Vista.

Although Sierra Vista is a modest-sized city by most calculations, being around 44,000 inhabitants, it is unusually cosmopolitan for its size, primarily because of the influence of the Fort Huachuca military reserve, which is the largest employer in the area, and partly because of an usually high concentration of high-tech industries and highly-educated
workers in the area (which may be a correlated factor to the presence of the base). 


## Problem Statement

Determine the best location for a new restaurant in Sierra Vista. While there are a number of factors that can influence the success or failure of a new eating establishment, 
this study will focus on population base and geographic distribution of existing restaurants relative to population centers.


## Methodology

The methodology I will use is to get the geospatial coordinates for the three postal codes covering the city of Sierra Vista. We will then get the list of existing restaurants from FourSquare, and we will perform a K-Means clustering on the existing restaurants to establish their location relative to the three postal codes. Finally, we will look to see which postal codes are underrepresented in the set of existing restaurants. 

## Data

In [65]:
!conda install -c conda-forge folium
!conda install -c conda-forge geocoder
import numpy as np
import pandas as pd
import sklearn
import sklearn.cluster
import time
import folium
import requests
import bs4
import urllib
import requests
import geocoder

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.



In [60]:
# get latitude and longitude for the three postcodes
# in Sierra Vista, Arizona
lat_lng_dict = {}
for postal_code in [85613,85635,85650]:
    # loop until you get the coordinates
    lat_lng_coords = None
    tries = 10
    while(lat_lng_coords is None and tries>0):
        g = geocoder.google('{}, Sierra Vista, Arizona'.format(postal_code))
        lat_lng_coords = g.latlng
        time.sleep(0.005)
        tries = tries-1
    if(lat_lng_dict!=None):
        lat_lng_dict[postal_code] = lat_lng_coords
        continue
# if the geocoder call failed, then hardwire coordinates
if lat_lng_dict[85613] is None:
    lat_lng_dict[85613] = [31.554610, -110.352470]
    lat_lng_dict[85635] = [31.549179, -110.262192]
    lat_lng_dict[85650] = [31.493020, -110.252360]
lat_lng_dict
llarray = np.array([[31.554610, -110.352470],[31.549179, -110.262192],[31.493020, -110.252360]])
llarray

array([[  31.55461 , -110.35247 ],
       [  31.549179, -110.262192],
       [  31.49302 , -110.25236 ]])

In [3]:
# scrape a website for population information
svpopURL = u'https://suburbanstats.org/population/arizona/how-many-people-live-in-sierra-vista'
fp = urllib.request.urlopen(svpopURL)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print("Read Website:",svpopURL)

Read Website: https://suburbanstats.org/population/arizona/how-many-people-live-in-sierra-vista


In [19]:
#svpopdf = pd.read_html(svpopURL)
def getTDStatsGrid(x):
    return x.name=='td'and x.has_attr('class') and (x['class'][0]=='statsGrid' or x['class'][0]=='statsGridOdd')
svhtml = bs4.BeautifulSoup(mystr)
svtables = svhtml.body.find_all('table')
for svtab in svtables:
    #print(svtab)
    print(svtab.h3)
    svtd = svtab.find_all('td')
    for dat in svtd:
        svtd2 = dat.find_all(getTDStatsGrid)
        for dat2 in svtd2:
            print("-->",dat2.string)
        
    

None
<h3><a href="/population/arizona/list-of-counties-and-cities-in-arizona">Other Counties and Cities in Arizona</a></h3>
--> Race
--> Population
--> % of Total
--> Total Population
--> 43,888
--> 100
--> White
--> 32,695
--> 74
--> Hispanic or Latino
--> 8,527
--> 19
--> Black or African American
--> 3,951
--> 9
--> Two or More Races
--> 2,515
--> 5
--> Some Other Race
--> 2,210
--> 5
--> Asian
--> 1,781
--> 4
--> American Indian
--> 467
--> 1
--> Native Hawaiian Pacific Islander
--> 269
--> Below 1%
--> Three or more races
--> 254
--> Below 1%
--> Male
--> Female
--> Total
--> Total Population
--> 22,334
--> 21,554
--> 43,888
--> White
--> 16,824
--> 15,871
--> 32,695
--> Hispanic or Latino
--> 4,127
--> 4,400
--> 8,527
--> Black or African American
--> 2,166
--> 1,785
--> 3,951
--> Two or More Races
--> 1,275
--> 1,240
--> 2,515
--> Some Other Race
--> 1,043
--> 1,167
--> 2,210
--> Asian
--> 648
--> 1,133
--> 1,781
--> American Indian
--> 251
--> 216
--> 467
--> Native Hawaiian Pa

In [51]:
svlat = 31.5455
svlon = -110.2773
svmap = folium.Map(location=[svlat,svlon], zoom_start=14)


In [21]:
# The code was removed by Watson Studio for sharing.

In [None]:
# Get popular venues in Sierra Vista from FourSqurare
version = '20190701'
radius = '15000'
limit = '200'
# Get FourSquare popularity data for Sierra Vista
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(    
    CLIENT_ID, 
    CLIENT_SECRET, 
    version, 
    svlat, 
    svlon, 
    radius, 
    limit)
results = requests.get(url).json()


In [44]:
# Process FourSquare results into useable data frame
# for displaying all of the restaurant venues around
# Sierra Vista on the folio map we are creating
venues = results['response']['groups'][0]['items']
venues = pd.io.json.json_normalize(venues)
# filter columns
venues = venues.loc[:, ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']]
# filter the category for each row
def extract_cat_name(x):
    return x[1][0]['name'].replace("'","")
def replace_name_quotes(x):
    return(x[0].replace("'","").replace("&","").replace(".",""))
# remove apostrophies from names and categories to prevent problems with Javascript
venues['venue.categories'] = venues.apply(extract_cat_name, axis=1)
venues['venue.name'] = venues.apply(replace_name_quotes, axis=1)
rest_mask = venues['venue.categories'].str.contains('Restaurant|Joint|Place|Diner|house',regex='True')
venues = venues[rest_mask].reset_index(drop=True)
venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Culvers,Fast Food Restaurant,31.549338,-110.25794
1,Tanuki Sushi Bar Garden,Sushi Restaurant,31.555063,-110.285828
2,SONIC Drive In,Fast Food Restaurant,31.554105,-110.260642
3,Hana Tokyo Hibachi Sushi Bar,Sushi Restaurant,31.538875,-110.256726
4,Fresh,Salad Place,31.528866,-110.259976
5,SUBWAY,Sandwich Place,31.559322,-110.256768
6,Native Grill Wings,Wings Joint,31.558144,-110.257872
7,Mod Pizza,Pizza Place,31.555179,-110.254598
8,Vinnys New York Pizza,Wings Joint,31.534207,-110.256246
9,Arbys,Fast Food Restaurant,31.555063,-110.278035


In [82]:
# cluster restaurants by postal code
my_centers = np.ndarray(shape=(3,2),dtype=float,buffer=llarray)
kmeans = sklearn.cluster.KMeans(n_clusters=3, init=my_centers, n_init=1, max_iter=300, tol=0.0001)
points = venues.filter(['venue.location.lat','venue.location.lng'])
venues['venue.cluster'] = kmeans.fit_predict(points)
venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng,venue.cluster
0,Culvers,Fast Food Restaurant,31.549338,-110.25794,1
1,Tanuki Sushi Bar Garden,Sushi Restaurant,31.555063,-110.285828,1
2,SONIC Drive In,Fast Food Restaurant,31.554105,-110.260642,1
3,Hana Tokyo Hibachi Sushi Bar,Sushi Restaurant,31.538875,-110.256726,1
4,Fresh,Salad Place,31.528866,-110.259976,1
5,SUBWAY,Sandwich Place,31.559322,-110.256768,1
6,Native Grill Wings,Wings Joint,31.558144,-110.257872,1
7,Mod Pizza,Pizza Place,31.555179,-110.254598,1
8,Vinnys New York Pizza,Wings Joint,31.534207,-110.256246,1
9,Arbys,Fast Food Restaurant,31.555063,-110.278035,1


In [83]:
# add venues to map
for lat, lng, name, cat, clust in zip(venues['venue.location.lat'], venues['venue.location.lng'], venues['venue.name'], venues['venue.categories'], venues['venue.cluster']):
    print(cat,':',name,'(', lat, ',', lng, ')', clust)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='green',
        popup=folium.Popup("'{}' '{}'".format(name,cat))
    ).add_to(svmap)


Fast Food Restaurant : Culvers ( 31.54933786556276 , -110.25793990285811 ) 1
Sushi Restaurant : Tanuki Sushi Bar  Garden ( 31.555063257078093 , -110.28582775922744 ) 1
Fast Food Restaurant : SONIC Drive In ( 31.5541052 , -110.2606416 ) 1
Sushi Restaurant : Hana Tokyo Hibachi Sushi  Bar ( 31.538874885077032 , -110.25672640154195 ) 1
Salad Place : Fresh ( 31.528866209911275 , -110.25997612731737 ) 1
Sandwich Place : SUBWAY ( 31.55932237734676 , -110.25676787533794 ) 1
Wings Joint : Native Grill  Wings ( 31.558143831584268 , -110.25787168352605 ) 1
Pizza Place : Mod Pizza ( 31.5551794 , -110.2545976 ) 1
Wings Joint : Vinnys New York Pizza ( 31.534207057343984 , -110.2562461037307 ) 1
Fast Food Restaurant : Arbys ( 31.5550628477915 , -110.278035302808 ) 1
Steakhouse : Texas Roadhouse ( 31.530422 , -110.257834 ) 1
Restaurant : Schlotzskys ( 31.554106284443254 , -110.25863350836838 ) 1
Asian Restaurant : Hibachi Grill Super Buffet ( 31.559618753694238 , -110.25578415748461 ) 1
Fast Food Rest

In [84]:
svmap


## Results

As can be seen in the preceeding section, the population of Sierra Vista is fairly evenly distributed between the three postal codes, while the existing restaurants are almost
exclusively located in just one postal code (cluster 1, the 85635 postal code). When we ran the K-Means clustering algorithm, we initialized the starting centroids to the latitude and longitude of the center of each postal code, allowing the clustering to separate the restaurants into three clusters based on their proximity to each other and to the center of the area of the postal code in which they reside. We used the geographical center of Sierra Vista as the center-point for a FourSquare query to return the 200 most popular venues, and of these 200 venues, 26 turned out to be restaurants, but these restaurants are not evenly dispursed across post codes. To be exact, of the 26 identified restaurants, 21 are located in the 85635 area code. One reason for this is that the 85613 postal code (cluster 0) is geographically limited to Fort Hauchuaca, a federal U.S. Army Post. Since the post strictly controls the eating establishments located on federal property, this naturally limits the number of restaurants that will be found there.

The big mystery is why there are only two restaurants in the 84650 postal code (cluster 2). While answering this question is beyond the scope of the current study, one can conjecture that this is probably due to the way Sierra Vista developed over time; first, as a relatively small community primarily supporting the post (concentrated primarily around its main gate), and then gradually growing east along the major highway, before branching out and growing towards the southeast. This has resulted in the vast majority of commercial activity being located in the northwest part of the city, near the post. The greatest concentration of existing restaurants are found co-located with other commercial property along the major east-west corridor (Fry Blvd), and then, secondarily, along the major north-south corridor (Interstate 90), and, in particular, where the two intersect. However, housing starts have continued to push southeast, with most of the new construction occuring well to the south of this major intersection, and well away from these exsting commercial areas. Consequently, this suggests an opportunity to capitalize on the convenience factor to this growing population center by locating a new restaurant in the commerically-zoned areas contained in postal code 85650.

## Discussion

This study is obviously flawed in several ways. First, the three postal codes do not provide enough granularity in looking at the distribution of population versus restaurant location to support an informed and specific recommendation of the location for a new restaurant. More specific neighborhood data would have been much better. I attempted to get data for census blocks for Sierra Vista, which corresponds much closer to the concept of "neighborhood", and although such data exists and is supposedly made available by the federal Census Bureau, I was unable to obtain a "key" to access this data. For this reason, I conducted my analysis at the level of the postal codes. Second, population data alone is not enough to base a decision about the location of a new restaurant. Other factors that should be considered include: accessibility, visibility, and availability of parking. While visibility may be a subjective measure, it should be possible to accumulate and analyze both accessibility and parking. One could envision the creation of a weighted formula that takes these factors into consideration and outputs a ranking of existing locations.

In the way of comments about the experience of creating this analysis, I found the IBM cloud environment a difficult environment to work in, particularly with respect to the web-based Watson Studio development environment in conjunction with the Jupityr Notebook facility. First, the environment is sensitive to network latency issues. Network latency (and associated networking issues) made it necessary for me to "re-load" the notebook periodically. East time the notebook was reloaded, the Python modules upon which my analysis depended, had to be re-evaluated. Each time this happened, it resulted in a 5 minute interruption to my work flow. Secondly, the Jupityr Notebook environment is currently having issues with GitHub. While the IBM Wastson Studio environment supports GitHub integration, once a version of the notebook was pushed to GitHub, it became inaccessible. Finally, the Folium map facility was inconsistent when viewed within the Watson Studio/Jupityr Notebook environment. At times, the map would display as expected within the notebook, but at other times, the map would refuse to display. I have been unable to ascertain what factors are influencing when the map displays and when it does not display. All of these factors negatively impact productivity, and increase anxiety and uncertainty while attempting to develop within a new technogy stack.

## Conclusions

This study has used geospatial information to analyze existing population versus restaurant distribution and has made a recommendation for the location of a new restaurant based upon this data. The recommended location for the new restaurant is in the 85650 area code, where the population is growing and where restaurants are under-represented. Further, more-granular data would need to be obtained and analyzed to make a more specific recommendation. 