# Impact of Neighborhood on Amazon HQ2 selection

###   *Will good neighborhood attract new business office for Unicorn companies*


## Introduction

Amazon announced the process to look for the 2nd headquarter in 2017, it led to a hot wave of biding on this proposal from 54 states, provinces, districts, and territories due to the huge financial and job potentials. 200 cities entered the final list. Among the 20 cities in the final lists, Long Island, New York and Crystal city, Virginia were selected as the locations for 2nd headquarters in Nov, 2018. Due to objection from various political parties, Amazon has canceled the selection Long Island NY in early 2019 while the development at Crystal city VA is still undergoing.

While Amazon has laid down the requirement for the 2HQ selection, such as Metropolitan areas with certain populations, close to popular center and highway/airport, availability of talents, financial incentives etc,  it will be interesting to check if the neighborhood of candidate locations/cities is an important criteria.  For example, does the 2HQ have similar neighborhood as that in current HQ in Seattle, WA? If the neighborhood similarity  plays significant role in 2HQ selection, it will provide enough information for cities/territories authorities to set a strategic approach to attract new businesses in future. 

## Table of Contents
- Download and Explore Dataset

- Explore Neighborhoods in Long island city, NY, crystal city, VA and Seattle, WA

- Preparation of Neighborhood venues data for clustering 

- Cluster Neighborhoods

- Compare the similarity of Neighborhoods in three locations



## Methodology
   
   The neighborhood info or list for three cities, Seattle WA, Arlington VA and Queens, NY can be obtained from open data source or Wiki website. Then the latitude and longitude information for each neighborhood ca been obtained using geopy package. After consolidation of all datasets, the top 100 popular venues from each neighborhood can be retrieved usign Four Square API. Then K-cluster algorithm will be used to cluster all neighborhood. Finally the similarity of neighborhood from each cities will be compared to check how much of similar neighborhood from 3 cities. This will lead to conclusion regarding the impact of neighborhood similarity on Amazon 2HQ selection.
   
   
   
## Download and Explore Dataset 

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np 

import pandas as pd 
import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Download the neighborhood list from [Seattle gov website](https://www.seattle.gov/neighborhoods/neighborhoods-and-districts) and save as Seattle_Neighborhood.csv file. Load the file into a dataframe

In [2]:
df_seattle= pd.read_csv('Seattle_Neighborhood.csv')
df_seattle.head()

Unnamed: 0,Neighborhood,City
0,23rd & Union/Jackson,Seattle
1,Admiral,Seattle
2,Aurora-Licton Springs,Seattle
3,Ballard,Seattle
4,Beacon Hill,Seattle


In [3]:
# define a function to get latitide and longitude based on an address


def is_empty(any_structure):
    if any_structure:
        return False
    else:
        return True   

    
def get_geospatial_data(Neighborhood, city, state):
    address = Neighborhood + ',' + city +','+ state 
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    if is_empty(location):
        return ['NaN', 'NaN']
    else: 
        return [location.latitude, location.longitude] 
   


In [4]:
# obtain the latidute and longitude data and add into data frame
SA_latitude =[]
SA_longitude=[]


for i in df_seattle['Neighborhood']:
    i=i.split('/')[0] 
    i=i.split('-')[0]
    if 'Junction' in i:
        i=i.strip('Junction')
    if 'Commercial Core' in i:
        i=i.strip('Commericial Core')
    geo_info = get_geospatial_data(i, 'Seattle', 'WA') 
    SA_latitude.append(geo_info[0])
    SA_longitude.append(geo_info[1])                              
    
df_seattle['Latitude']=SA_latitude
df_seattle['Longitude']=SA_longitude

df_seattle

Unnamed: 0,Neighborhood,City,Latitude,Longitude
0,23rd & Union/Jackson,Seattle,47.612171,-122.302627
1,Admiral,Seattle,47.581195,-122.386546
2,Aurora-Licton Springs,Seattle,47.603832,-122.330062
3,Ballard,Seattle,47.676507,-122.386223
4,Beacon Hill,Seattle,47.579258,-122.311598
5,Belltown,Seattle,47.613231,-122.345361
6,Bitter Lake/Broadview,Seattle,47.726645,-122.352272
7,Capitol Hill,Seattle,47.623831,-122.318369
8,Chinatown-International District,Seattle,47.599175,-122.323229
9,Columbia City,Seattle,47.557912,-122.285216


#### Run the same procedure for Arlington, VA

In [5]:
#check if the latidude and longitude is unique for each neighborhood
df_seattle.nunique()

Neighborhood    40
City             1
Latitude        40
Longitude       40
dtype: int64

In [9]:
#Drop the Neighborhood with no latitude data
# Need to consolidate the neighboorhood with same latitide and longtiude and drop the Neighborhood with no latitude data
df_arlington_new=df_arlington.groupby(['City', 'Latitude', 'Longitude'], as_index=False, sort =False)[['Neighborhood']].agg('/'.join)
df_arlington_new.drop(df_arlington_new[df_arlington_new['Longitude']=='NaN'].index, inplace =True, axis=0)
df_arlington_new.reset_index(drop=True, inplace=True)
df_arlington_new.nunique()


City             1
Latitude        52
Longitude       52
Neighborhood    52
dtype: int64

In [7]:
df_arlington= pd.read_csv('Arlington_Neighborhood.csv')
df_arlington

Unnamed: 0,Neighborhood,City
0,Alcova Heights,Arlington
1,Arlington Forest,Arlington
2,Arlington Heights,Arlington
3,Arlington Ridge,Arlington
4,Arlington View / Johnson's Hill,Arlington
5,Aurora Hills,Arlington
6,Ballston,Arlington
7,Barcroft,Arlington
8,Bellevue Forest,Arlington
9,Bluemont,Arlington


In [8]:
# obtain the latidute and longitude data and add into data frame
AR_latitude =[]
AR_longitude=[]


for i in df_arlington['Neighborhood']:
    i=i.split('/')[0] 
    i=i.split('-')[0]
    geo_info = get_geospatial_data(i, 'Arlington', 'VA') 
    AR_latitude.append(geo_info[0])
    AR_longitude.append(geo_info[1])  
    
df_arlington['Latitude']=AR_latitude
df_arlington['Longitude']=AR_longitude

In [10]:
df_arlington_new

Unnamed: 0,City,Latitude,Longitude,Neighborhood
0,Arlington,38.8646,-77.0972,Alcova Heights
1,Arlington,38.8689,-77.1131,Arlington Forest
2,Arlington,38.8696,-77.0922,Arlington Heights
3,Arlington,40.984,-81.4939,Arlington Ridge
4,Arlington,38.8631,-77.0726,Arlington View / Johnson's Hill
5,Arlington,38.8515,-77.0641,Aurora Hills
6,Arlington,38.8831,-77.1101,Ballston
7,Arlington,38.8559,-77.1039,Barcroft
8,Arlington,38.9143,-77.1136,Bellevue Forest
9,Arlington,38.8747,-77.133,Bluemont


In [11]:
# one hot code to fix the columns order
fixed_columns = [df_arlington_new.columns[-1]] + list(df_arlington_new.columns[:-1])

df_arlington_final = df_arlington_new[fixed_columns]

df_arlington_final

Unnamed: 0,Neighborhood,City,Latitude,Longitude
0,Alcova Heights,Arlington,38.8646,-77.0972
1,Arlington Forest,Arlington,38.8689,-77.1131
2,Arlington Heights,Arlington,38.8696,-77.0922
3,Arlington Ridge,Arlington,40.984,-81.4939
4,Arlington View / Johnson's Hill,Arlington,38.8631,-77.0726
5,Aurora Hills,Arlington,38.8515,-77.0641
6,Ballston,Arlington,38.8831,-77.1101
7,Barcroft,Arlington,38.8559,-77.1039
8,Bellevue Forest,Arlington,38.9143,-77.1136
9,Bluemont,Arlington,38.8747,-77.133


### Long island city is a neighborhood belonging to Queens borough.  So it is possible to use the New York Geospatial json to get the geo info  Queens neighborhoods

In [12]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
newyork_neighborhood = newyork_data['features']

#### Transfer data into DataFeame and extract the neighborhood data from Borough Queens

In [13]:
column_names =['Neighborhood', 'City', 'Latitude', 'Longitude'] 

# instantiate the dataframe
NY_neighborhoods = pd.DataFrame(columns=column_names)

# get the neighborhood data geo data for NY boroughs
for data in newyork_neighborhood:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    NY_neighborhoods = NY_neighborhoods.append({'Neighborhood': neighborhood_name,
                                          'City': borough,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
NY_neighborhoods.head()


Unnamed: 0,Neighborhood,City,Latitude,Longitude
0,Wakefield,Bronx,40.894705,-73.847201
1,Co-op City,Bronx,40.874294,-73.829939
2,Eastchester,Bronx,40.887556,-73.827806
3,Fieldston,Bronx,40.895437,-73.905643
4,Riverdale,Bronx,40.890834,-73.912585


In [14]:
#extarct the information of queens borough
df_queens= NY_neighborhoods[NY_neighborhoods['City']=='Queens']
df_queens.head()

Unnamed: 0,Neighborhood,City,Latitude,Longitude
129,Astoria,Queens,40.768509,-73.915654
130,Woodside,Queens,40.746349,-73.901842
131,Jackson Heights,Queens,40.751981,-73.882821
132,Elmhurst,Queens,40.744049,-73.881656
133,Howard Beach,Queens,40.654225,-73.838138


In [15]:
#check if the latidude and longitude is unique for each neighborhood
df_queens.nunique()

Neighborhood    81
City             1
Latitude        81
Longitude       81
dtype: int64

### Now merge the neighboorhoods data from 3 cities together to form final dataset for data analysis 

In [16]:
df_combine_neighborhood = pd.concat([df_seattle, df_arlington_final,df_queens], ignore_index=True, sort=False)

df_combine_neighborhood


Unnamed: 0,Neighborhood,City,Latitude,Longitude
0,23rd & Union/Jackson,Seattle,47.6122,-122.303
1,Admiral,Seattle,47.5812,-122.387
2,Aurora-Licton Springs,Seattle,47.6038,-122.33
3,Ballard,Seattle,47.6765,-122.386
4,Beacon Hill,Seattle,47.5793,-122.312
5,Belltown,Seattle,47.6132,-122.345
6,Bitter Lake/Broadview,Seattle,47.7266,-122.352
7,Capitol Hill,Seattle,47.6238,-122.318
8,Chinatown-International District,Seattle,47.5992,-122.323
9,Columbia City,Seattle,47.5579,-122.285


In [17]:
df_combine_neighborhood.nunique()

Neighborhood    172
City              3
Latitude        173
Longitude       173
dtype: int64

In [18]:
# Save the final dataset as AmazonHQ_Geo_info_cvs

with open('AmazonHQ_Geo_info.csv', 'w') as writefile:
    df_combine_neighborhood.to_csv(writefile, index=False, sep=';', encoding='utf-8')

In [19]:
df_combine_neighborhood.describe(include='all')

Unnamed: 0,Neighborhood,City,Latitude,Longitude
count,173,173,173.0,173.0
unique,172,3,173.0,173.0
top,Forest Hills,Queens,47.526468,-73.758676
freq,2,81,1.0,1.0


### Creat maps for 3 HQ cities

In [20]:
# create map of Seattle using latitude and longitude values

seatttle_location= get_geospatial_data('Seattle', 'WA', 'USA')
map_seattle = folium.Map(location=[seatttle_location[0], seatttle_location[1]], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df_seattle['Latitude'], df_seattle['Longitude'], df_seattle['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_seattle)  
    
map_seattle

In [21]:
# create map of Arlington using latitude and longitude values

arlington_location= get_geospatial_data('Arlington', 'VA', 'USA')
map_arlington = folium.Map(location=[arlington_location[0], arlington_location[1]], zoom_start=12)

# add markers to map
for lat, lng, label in zip(df_arlington_final['Latitude'], df_arlington_final['Longitude'], df_arlington_final['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_arlington)  
    
map_arlington

In [22]:
# create map of Queens, NY using latitude and longitude values

queens_location= get_geospatial_data('Queens', 'NY', 'USA')
map_queens = folium.Map(location=[queens_location[0], queens_location[1]], zoom_start=10)

# add markers to map
for lat, lng, label in zip(df_queens['Latitude'], df_queens['Longitude'], df_queens['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_queens)  
    
map_queens

## Creat the top hundreds venues data using Four square API based on the neighborhood Geo data from 3 amazon HQ cities.

#### Define Four square credential but the code block is hiden.¶

In [23]:
# @hidden_cell
#Define Foursquare Credentials and Version
CLIENT_ID = 'V2QO0CU0PROQKFQ5FA3C0A2MHJHKOPVXGXSXBVB0OJNDTOKJ' 
CLIENT_SECRET = 'YZKGAREDL12WXLITXOKCA0YE2BW14MLAUDNJ1RPVGR5RRAUU' 
VERSION = '20180605' # Foursquare API version

#### Create a function to explor top 100 venues near a Neighborhood, then explore the top100 venues at each neighborhood

In [24]:
#create a function to explor top 100 venues near a Neighborhood
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [25]:
# run the above function on each neighborhood and create a new dataframe called Toront_venues.

AmazonHQ_venues = getNearbyVenues(names=df_combine_neighborhood['Neighborhood'],
                                   latitudes=df_combine_neighborhood['Latitude'],
                                   longitudes=df_combine_neighborhood['Longitude'],
                                 radius=500 )

23rd & Union/Jackson
Admiral
Aurora-Licton Springs
Ballard
Beacon Hill
Belltown
Bitter Lake/Broadview
Capitol Hill
Chinatown-International District
Columbia City
Crown Hill
Delridge
Downtown Commercial Core
Eastlake
First Hill
Fremont
Georgetown
Green Lake
Greenwood-Phinney Ridge
Judkins Park
Lake City
Madison-Miller
Magnolia
Montlake
Morgan Junction
North Rainier / Mount Baker
Northgate
Othello
Pioneer Square
Queen Anne
Rainier Beach
Roosevelt
Sand Point
South Lake Union
South Park
University District
Uptown
Wallingford
West Seattle Junction
Westwood Village / Roxhill-Highland Park
Alcova Heights
Arlington Forest
Arlington Heights
Arlington Ridge
Arlington View / Johnson's Hill
Aurora Hills
Ballston
Barcroft
Bellevue Forest
Bluemont
Bonair
Brandon Village
Buckingham
Carlin Springs
Cherrydale
Claremont
Clarendon
Columbia Forest
Columbia Heights
Country Club Hills
Crescent Hills
Crystal City
Crystal Gateway
Dominion Hills
Donaldson Run
Douglas Park
East Falls Church
Fairlington
Forest H

In [26]:
AmazonHQ_venues.shape

(4679, 7)

In [27]:
AmazonHQ_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,23rd & Union/Jackson,47.612171,-122.302627,Uncle Ike's CD,47.613135,-122.301686,Alternative Healer
1,23rd & Union/Jackson,47.612171,-122.302627,Central Cinema,47.613319,-122.30528,Indie Movie Theater
2,23rd & Union/Jackson,47.612171,-122.302627,Raised Doughnuts,47.611767,-122.302825,Donut Shop
3,23rd & Union/Jackson,47.612171,-122.302627,Chuck's Hop Shop - Central District,47.612856,-122.305924,Beer Store
4,23rd & Union/Jackson,47.612171,-122.302627,Squirrel Chops,47.61288,-122.303361,Coffee Shop


In [28]:
# Save the final dataset as AmazonHQ_Venues_info_cvs

with open('AmazonHQ_Venues_info.csv', 'w') as writefile:
    AmazonHQ_venues.to_csv(writefile, index=False, sep=';', encoding='utf-8')