### This notebook will be utilized for the capstone project for the IBM Data Science Professional Certification. The following is a report for "The Battle of Neighborhoods" by Alexa Lodise.

# Introduction/Business Problem

### Breweries have become a popular choice for where people in the cities hang out after work and on the weekends. That being said, in the right city, opening a brewery can be a lucrative business. However, it is difficult to tell what city and where in the city to open one. Using data analysis and clustering, I will determine whether someone should open their new brewery in Toronto or New York City as well as where in the resulting city they should open it. 

# Data

### In order to solve this problem, I will need to gather coordinates for various neighborhoods in both New York City and Toronto. The New York City data is from NYU and the Toronto data is from Wikipedia. Once this data is converted into clean dataframes, Foursquare will be utilized to find where breweries are in each city. I will choose the city based on which one has the least amount of breweries. Once the city is picked, I will determine which neighborhood has the least amount of breweries using clustering and will then recommend that area for a new business.

### Gathering coordinates for NYC:

In [3]:
# importing all required packages for the functions that will be used 
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [4]:
# Downloading NYC Data
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset


In [5]:
# Load data into variable
with open('newyork_data.json') as json_data:
    nyc = json.load(json_data)

In [6]:
# gather relevant data from nyc data
neighborhoods_data = nyc['features']

In [7]:
# put data into pandas dataframe
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
nycdata = pd.DataFrame(columns=column_names)


for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    nycdata = nycdata.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [8]:
# Look at dataframe to make sure everything is correct
nycdata.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [9]:
# Gather coordinates for NYC itself
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
nyclocation = geolocator.geocode(address)
nyclatitude = nyclocation.latitude
nyclongitude = nyclocation.longitude

### Gather coordinate data for Toronto:

In [10]:
# Get data off of wikipedia
import urllib.request
import pandas as pd

# get wikipedia html data
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = urllib.request.urlopen(url)
article = req.read().decode()

with open('ISO_3166-1_alpha-2.html', 'w') as fo:
    fo.write(article)

In [11]:
# Use beautiful soup to scrap data from wikipedia article into a dataframe
from bs4 import BeautifulSoup

# Load article, turn into soup and get the <table>s.
article = open('ISO_3166-1_alpha-2.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')
lst=[]
cols=['Postal Code', 'Borough', 'Neighbourhood']
# Search through the tables for the one with the headings we want.
for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:3] == ['Postcode', 'Borough', 'Neighbourhood']:
        break

        
# Extract the columns we want and write to a semicolon-delimited text file.
with open('iso_3166-1_alpha-2_codes.txt', 'w') as fo:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        postcode, borough, neighbourhood = [td.text.strip() for td in tds[:3]]
        lst.append([postcode, borough, neighbourhood])
#create dataframe with found values        
df3=pd.DataFrame(lst, columns=cols)
        
df3.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [12]:
# Clean up data frame to not have null data or repeating postal codes
#filter out not assigned data
dffilter=df3[df3.Borough != 'Not assigned']
dffilter

# replace not assigned neighbourhood data with borough data
import numpy as np
dffilter['Neighbourhood'] = np.where(dffilter['Neighbourhood'] == 'Not assigned', dffilter['Borough'], dffilter['Neighbourhood'])

#group repeat postal code values and combine the neighbourhoods
dfgroup = dffilter.groupby('Postal Code').agg({'Borough':'first', 
                             'Neighbourhood': ', '.join}).reset_index()

dfgroup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [13]:
# put coordinate data from excel into a dataframe and then append it to my current toronto dataframe
xlfile=pd.read_csv('Geospatial_Coordinates.csv')

torontodata=pd.merge(dfgroup, xlfile, on='Postal Code')
torontodata.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [14]:
# Gather coordinates for Toronto itself
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude


### Use Foursquare to gather brewery venue data for NYC:

In [15]:
# entering client data to connect to foursquare
CLIENT_ID = 'SUCYIMTNIYMFEUWTUGEUMM2UVKL4G15K0SSJD1UFXSHHSIAQ' # your Foursquare ID
CLIENT_SECRET = 'SICBSQK5JRUNJOC2KFOVXVJCPTXHTYHWTBEB3R1PIBMGOMDZ' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: SUCYIMTNIYMFEUWTUGEUMM2UVKL4G15K0SSJD1UFXSHHSIAQ
CLIENT_SECRET:SICBSQK5JRUNJOC2KFOVXVJCPTXHTYHWTBEB3R1PIBMGOMDZ


In [16]:
# gather nearby venue data for all neighborhoods in NYC and put into dataframe

def getNearbyVenues(names, nyclatitudes, nyclongitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, nyclatitudes, nyclongitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [17]:
# run above function and put data into a dataframe called nyc_venues

nyc = getNearbyVenues(names=nycdata['Neighborhood'],
                                   nyclatitudes=nycdata['Latitude'],
                                   nyclongitudes=nycdata['Longitude']
                                  )


Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [19]:
# Look at dataframe to verify data
nyc.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Cooler Runnings Jamaican Restaurant Inc,40.898276,-73.850381,Caribbean Restaurant
3,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
4,Wakefield,40.894705,-73.847201,SUBWAY,40.890656,-73.849192,Sandwich Place


In [20]:
# Filter out dataframe for just brewery data
nycbrews=nyc.loc[nyc['Venue Category'] == "Brewery"]
nycbrews

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
450,Port Morris,40.801664,-73.913221,The Bronx Brewery,40.801774,-73.910297,Brewery
1229,Prospect Heights,40.676822,-73.964859,Bitter & Esters,40.677092,-73.96369,Brewery
1469,Red Hook,40.676253,-74.012759,Sixpoint Brewery,40.673835,-74.011866,Brewery
1485,Gowanus,40.673931,-73.994441,Other Half Brewing Co.,40.67379,-73.999134,Brewery
1658,Coney Island,40.574293,-73.988683,Coney Island Brewing Co.,40.575189,-73.983991,Brewery
1812,Downtown,40.690844,-73.983463,Circa Brewing Co,40.691685,-73.986218,Brewery
1852,Prospect Lefferts Gardens,40.65842,-73.954899,Island to Island Brewery,40.655835,-73.953091,Brewery
3428,Glendale,40.702762,-73.870742,Finback Brewery,40.706567,-73.873179,Brewery
3963,Steinway,40.775923,-73.90229,SingleCut Beersmiths,40.778387,-73.901902,Brewery
4669,Tompkinsville,40.637316,-74.080554,Flagship Brewing Co.,40.636994,-74.075694,Brewery


### Use Foursquare data to gather brewery data for Toronto:

In [21]:
# gather data for venues in toronto

toronto_venues = getNearbyVenues(names=torontodata['Neighbourhood'],
                                   nyclatitudes=torontodata['Latitude'],
                                   nyclongitudes=torontodata['Longitude']
                                  )

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West, 

In [22]:
# verify data looks correct
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Marina Spa,43.766,-79.191,Spa


In [23]:
# Filter out dataframe for just brewery data
torontobrews=toronto_venues.loc[toronto_venues['Venue Category'] == "Brewery"]
torontobrews

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
284,Leaside,43.70906,-79.363452,Amsterdam Barrel House,43.706021,-79.361329,Brewery
315,"The Danforth West, Riverdale",43.679557,-79.352188,Louis Cifer Brew Works,43.677663,-79.351313,Brewery
343,"The Beaches West, India Bazaar",43.668999,-79.315572,Godspeed Brewery,43.67262,-79.319228,Brewery
433,Davisville,43.704324,-79.38879,Granite Brewery,43.707991,-79.389943,Brewery
1041,"Dovercourt Village, Dufferin",43.669005,-79.442259,Blood Brothers Brewing,43.669944,-79.436533,Brewery
1058,"Little Portugal, Trinity",43.647927,-79.41975,Bellwoods Brewery,43.647097,-79.419955,Brewery
1120,"The Junction North, Runnymede",43.673185,-79.487262,High Park Brewery,43.669903,-79.48343,Brewery
1233,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558,Rorschach Brewing Co.,43.663483,-79.319824,Brewery


## Methodology

#### The analysis of where a business person should place a brewery will begin with analyzing the number of breweries that are in each city as well as visualizing how close together they are. Once that is complete, I can decide which city I will run a cluster analysis on. This cluster analysis will determine which neighborhood to place a brewery in. 

## Analysis

### Figuring out the number of breweries in each city:

In [24]:
# Do some initial descriptive stats on breweries 
nyc_count=len(nycbrews.index)
toronto_count=len(torontobrews.index)
print('Number of breweries in NYC:', nyc_count)


Number of breweries in NYC: 10


In [25]:
print('Number of breweries in Toronto:', toronto_count)

Number of breweries in Toronto: 8


### Visualizing the breweries in each city:

In [32]:
import folium 
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(torontobrews['Neighborhood Latitude'], torontobrews['Neighborhood Longitude'], torontobrews['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto


In [54]:
import folium 
# create map of nyc using latitude and longitude values
map_nyc = folium.Map(location=[nyclatitude, nyclongitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(nycbrews['Neighborhood Latitude'],nycbrews['Neighborhood Longitude'], nycbrews['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_nyc)  
    
map_nyc

#### Based on this analysis, NYC has more breweries than Toronto with 8 vs 10, but the breweries are more spread out in NYC. That being said, clustering will be used to figure out where in NYC someone should open their brewery.

### NYC brewery clustering:

In [73]:
# one hot encoding
manhattan_onehot = pd.get_dummies(nycbrews[['Venue']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = nycbrews['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot = manhattan_onehot[fixed_columns]

manhattan_onehot.head()

Unnamed: 0,Neighborhood,Bitter & Esters,Circa Brewing Co,Coney Island Brewing Co.,Finback Brewery,Flagship Brewing Co.,Island to Island Brewery,Other Half Brewing Co.,SingleCut Beersmiths,Sixpoint Brewery,The Bronx Brewery
450,Port Morris,0,0,0,0,0,0,0,0,0,1
1229,Prospect Heights,1,0,0,0,0,0,0,0,0,0
1469,Red Hook,0,0,0,0,0,0,0,0,1,0
1485,Gowanus,0,0,0,0,0,0,1,0,0,0
1658,Coney Island,0,0,1,0,0,0,0,0,0,0


In [74]:
manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()
manhattan_grouped

Unnamed: 0,Neighborhood,Bitter & Esters,Circa Brewing Co,Coney Island Brewing Co.,Finback Brewery,Flagship Brewing Co.,Island to Island Brewery,Other Half Brewing Co.,SingleCut Beersmiths,Sixpoint Brewery,The Bronx Brewery
0,Coney Island,0,0,1,0,0,0,0,0,0,0
1,Downtown,0,1,0,0,0,0,0,0,0,0
2,Glendale,0,0,0,1,0,0,0,0,0,0
3,Gowanus,0,0,0,0,0,0,1,0,0,0
4,Port Morris,0,0,0,0,0,0,0,0,0,1
5,Prospect Heights,1,0,0,0,0,0,0,0,0,0
6,Prospect Lefferts Gardens,0,0,0,0,0,1,0,0,0,0
7,Red Hook,0,0,0,0,0,0,0,0,1,0
8,Steinway,0,0,0,0,0,0,0,1,0,0
9,Tompkinsville,0,0,0,0,1,0,0,0,0,0


In [85]:
num_top_venues = 1

for hood in manhattan_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = manhattan_grouped[manhattan_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Coney Island----
                      venue  freq
0  Coney Island Brewing Co.   1.0


----Downtown----
              venue  freq
0  Circa Brewing Co   1.0


----Glendale----
             venue  freq
0  Finback Brewery   1.0


----Gowanus----
                    venue  freq
0  Other Half Brewing Co.   1.0


----Port Morris----
               venue  freq
0  The Bronx Brewery   1.0


----Prospect Heights----
             venue  freq
0  Bitter & Esters   1.0


----Prospect Lefferts Gardens----
                      venue  freq
0  Island to Island Brewery   1.0


----Red Hook----
              venue  freq
0  Sixpoint Brewery   1.0


----Steinway----
                  venue  freq
0  SingleCut Beersmiths   1.0


----Tompkinsville----
                  venue  freq
0  Flagship Brewing Co.   1.0




In [86]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [87]:
num_top_venues = 1

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue
0,Coney Island,Coney Island Brewing Co.
1,Downtown,Circa Brewing Co
2,Glendale,Finback Brewery
3,Gowanus,Other Half Brewing Co.
4,Port Morris,The Bronx Brewery


In [96]:
# set number of clusters
kclusters = 7

manhattan_grouped_clustering = manhattan_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([8, 7, 9, 2, 3, 0, 6, 5, 1, 4], dtype=int32)

In [97]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels10', kmeans.labels_)

manhattan_merged = nycbrews

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
manhattan_merged = manhattan_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
450,Port Morris,40.801664,-73.913221,The Bronx Brewery,40.801774,-73.910297,Brewery,3,2,0,The Bronx Brewery
1229,Prospect Heights,40.676822,-73.964859,Bitter & Esters,40.677092,-73.96369,Brewery,0,0,4,Bitter & Esters
1469,Red Hook,40.676253,-74.012759,Sixpoint Brewery,40.673835,-74.011866,Brewery,5,4,2,Sixpoint Brewery
1485,Gowanus,40.673931,-73.994441,Other Half Brewing Co.,40.67379,-73.999134,Brewery,2,0,0,Other Half Brewing Co.
1658,Coney Island,40.574293,-73.988683,Coney Island Brewing Co.,40.575189,-73.983991,Brewery,8,3,1,Coney Island Brewing Co.


In [99]:
# create map
map_clusters = folium.Map(location=[nyclatitude, nyclongitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Neighborhood Latitude'], manhattan_merged['Neighborhood Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels9']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Based on this analysis, the red cluster is not where I would recommend putting a brewery; however, the blue cluster looks like the perfect spot because it is away from other breweries.

In [100]:
manhattan_merged.loc[manhattan_merged['Cluster Labels9'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
1229,40.676822,-73.96369,Brewery,0,0,4,Bitter & Esters
1485,40.673931,-73.999134,Brewery,2,0,0,Other Half Brewing Co.
1812,40.690844,-73.986218,Brewery,7,0,0,Circa Brewing Co
1852,40.65842,-73.953091,Brewery,6,0,0,Island to Island Brewery


In [101]:
manhattan_merged.loc[manhattan_merged['Cluster Labels9'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
3963,40.775923,-73.901902,Brewery,1,1,3,SingleCut Beersmiths


In [102]:
manhattan_merged.loc[manhattan_merged['Cluster Labels9'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
450,40.801664,-73.910297,Brewery,3,2,0,The Bronx Brewery


In [103]:
manhattan_merged.loc[manhattan_merged['Cluster Labels9'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
1658,40.574293,-73.983991,Brewery,8,3,1,Coney Island Brewing Co.


In [104]:
manhattan_merged.loc[manhattan_merged['Cluster Labels9'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
1469,40.676253,-74.011866,Brewery,5,4,2,Sixpoint Brewery


In [105]:
manhattan_merged.loc[manhattan_merged['Cluster Labels9'] == 5, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
4669,40.637316,-74.075694,Brewery,4,5,0,Flagship Brewing Co.


In [106]:
manhattan_merged.loc[manhattan_merged['Cluster Labels9'] == 6, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood Latitude,Venue Longitude,Venue Category,Cluster Labels10,Cluster Labels9,Cluster Labels8,1st Most Common Venue
3428,40.702762,-73.873179,Brewery,9,6,0,Finback Brewery


#### This shows that the first cluster is the most populated with 4 breweries.

## Results and Discussion

#### Our analysis shows that placing a brewery in NYC is the best choice. This is because it had breweries that were not as close together as Toronto did even though it had more breweries in total. The cluster analysis showed that putting a brewery anywhere but cluster 1 would work. This is because that cluster had 4 breweries while the other ones had 1.

## Conclusion

#### The main purpose of this project was to recommend a city to place a brewery in as well as where in the city to put it. This analysis concluded that placing a brewery in New York in any neighborhood, besides Prospect Heights, Gowanus, Downtown, or Prospect Lefferts Gardens, will provide a great business opportunity for a new brewery. 