# Capstone Project--The Battle of Neighborhoods

## Table of Contents
* [Introduction (and Business Problem)](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction (and Business Problem) <a name="introduction"></a>

**Pittsburgh, PA** is the city I am currently living in. I've been here for almost two years and it is generally a good place to live. There is a variety of restaurants around the city but how are they distributed among different areas? What if I want to find a restaurant of a specific type, where should I go? If I want to open a new restaurant, which kind of it should I choose for each location? These kinds of problems would be anwsered in this project by using data science techniques, specifically clustering, and other methods. Be prepared, and start the gourmet adventure of three rivers!

## Data <a name="data"></a>

In this preject, we need different geo-spatial information about Pittsburgh.  

First, we need **zip codes and their coordinates of Pittsburgh neighborhoods**. This can be obtained from some internet sources. For this project, I am using the data from *US Zip Code Latitude and Longitude* [https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/table/], specifically the data of Pennsylvania and do some data cleaning to get the data we need.   

Next, we need **restrant information of each neighborhood**, using Foursquare API. Using this data, we can cluster the neighborhoods and find the pattern of different kinds of restaurants in each neighborhood.

## Methodology <a name="methodology"></a>

In this project, we will use the zip codes of each neighborhood in Pittsburgh and their corresponding coordinates along with the venue data from Foursquare API to cluster them into different clusters based on the patterns of popular restaurant types in each neighborhood using K-means clustering. After getting all the cluster information, we can analyze each cluster and determine the right type of restaurant to open in the neighborhood.

## Analysis <a name="analysis"></a>

### Import the coordinate data of Pennsylvania

In [87]:
import pandas as pd
penn_data = pd.read_csv('us-zip-code-latitude-and-longitude.csv')
penn_data.head()

Unnamed: 0,Zip,City,State,Latitude,Longitude,Timezone,Daylight savings time flag,Latitude.1,Longitude.1
0,17932,Frackville,PA,40.649109,-76.503339,-5,1,40.649109,-76.503339
1,18947,Pipersville,PA,40.426391,-75.11842,-5,1,40.426391,-75.11842
2,15278,Pittsburgh,PA,40.434436,-80.024817,-5,1,40.434436,-80.024817
3,15482,Star Junction,PA,40.062849,-79.76338,-5,1,40.062849,-79.76338
4,15227,Pittsburgh,PA,40.377869,-79.97516,-5,1,40.377869,-79.97516


### Drop the irrelevant columns

In [88]:
penn_data.drop(['State', 'Timezone', 'Daylight savings time flag', 'Latitude.1', 'Longitude.1'], 
               axis=1, inplace=True)
penn_data.head()

Unnamed: 0,Zip,City,Latitude,Longitude
0,17932,Frackville,40.649109,-76.503339
1,18947,Pipersville,40.426391,-75.11842
2,15278,Pittsburgh,40.434436,-80.024817
3,15482,Star Junction,40.062849,-79.76338
4,15227,Pittsburgh,40.377869,-79.97516


### Get the zipcodes of Pittsburgh

In [89]:
pitt_data = penn_data[penn_data['City'].astype(str).str.contains('Pittsburgh')].reset_index(drop=True)
pitt_data.head()

Unnamed: 0,Zip,City,Latitude,Longitude
0,15278,Pittsburgh,40.434436,-80.024817
1,15227,Pittsburgh,40.377869,-79.97516
2,15238,Pittsburgh,40.518701,-79.86744
3,15242,Pittsburgh,40.434436,-80.024817
4,15236,Pittsburgh,40.342869,-79.97929


In [90]:
pitt_data.shape

(79, 4)

In [94]:
pitt_data = pitt_data.drop_duplicates(subset = ["Latitude","Longitude"])
pitt_data.rename(columns={'Zip':'Neighborhood'}, inplace=True)
print(pitt_data.shape)
pitt_data.head()

(44, 4)


Unnamed: 0,Neighborhood,City,Latitude,Longitude
0,15278,Pittsburgh,40.434436,-80.024817
1,15227,Pittsburgh,40.377869,-79.97516
2,15238,Pittsburgh,40.518701,-79.86744
4,15236,Pittsburgh,40.342869,-79.97929
6,15239,Pittsburgh,40.482655,-79.74278


### Use geopy library to get the latitude and longitude values of Pittsburgh.

In [14]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


In [15]:
address = 'Pittsburgh, PA'

geolocator = Nominatim(user_agent="pitt_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Pittsburgh are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Pittsburgh are 40.4416941, -79.9900861.


### Create a map of Pittsburgh with the neighborhoods

In [95]:
# create map of Pittsburgh using latitude and longitude values
map_pitt = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng,Zip in zip(pitt_data['Latitude'], pitt_data['Longitude'], pitt_data['Neighborhood']):
    label = '{}'.format(Zip)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_pitt)  
    
map_pitt

### Define Foursquare Credentials and Version

In [30]:
CLIENT_ID = 'KTVSBSRPWUBXQDVILIXX1WHKEPNFT2JQ42HJOTBZO0K4TCKU' # your Foursquare ID
CLIENT_SECRET = '5P4LQMAS2IZU4CGYV31LHDXGT1X1JHTYP0ARBBP0KVK5BUFG' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: KTVSBSRPWUBXQDVILIXX1WHKEPNFT2JQ42HJOTBZO0K4TCKU
CLIENT_SECRET:5P4LQMAS2IZU4CGYV31LHDXGT1X1JHTYP0ARBBP0KVK5BUFG


### Create a function for all neighborhood

In [31]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [96]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
pitt_venues = getNearbyVenues(names=pitt_data['Neighborhood'],
                                   latitudes=pitt_data['Latitude'],
                                   longitudes=pitt_data['Longitude']
                                  )

15278
15227
15238
15236
15239
15112
15201
15275
15216
15246
15228
15225
15237
15212
15243
15205
15203
15234
15218
15229
15206
15204
15220
15222
15213
15209
15210
15224
15241
15202
15208
15215
15219
15226
15223
15298
15233
15211
15221
15214
15217
15232
15235
15207


In [97]:
print(pitt_venues.shape)
pitt_venues.head()

(636, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,15278,40.434436,-80.024817,Abandoned Blue Car,40.433925,-80.029791,Boat or Ferry
1,15278,40.434436,-80.024817,The bus,40.430508,-80.024321,Moving Target
2,15278,40.434436,-80.024817,Fort Pitt Tunnel,40.43457,-80.019372,Tunnel
3,15278,40.434436,-80.024817,South Hills Republican Club,40.437701,-80.021399,Bar
4,15227,40.377869,-79.97516,Brentwood Park,40.374912,-79.971486,Park


In [98]:
pitt_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
15112,10,10,10,10,10,10
15201,49,49,49,49,49,49
15202,7,7,7,7,7,7
15203,45,45,45,45,45,45
15204,1,1,1,1,1,1
15205,15,15,15,15,15,15
15206,5,5,5,5,5,5
15207,4,4,4,4,4,4
15208,6,6,6,6,6,6
15209,6,6,6,6,6,6


In [99]:
print('There are {} uniques categories.'.format(len(pitt_venues['Venue Category'].unique())))

There are 192 uniques categories.


### Analyze each neighborhood with one-hot encoding of the categories

In [100]:
# one hot encoding
pitt_onehot = pd.get_dummies(pitt_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
pitt_onehot['Neighborhood'] = pitt_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [pitt_onehot.columns[-1]] + list(pitt_onehot.columns[:-1])
pitt_onehot = pitt_onehot[fixed_columns]

pitt_onehot.head()

Unnamed: 0,Neighborhood,Adult Boutique,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Garage,Auto Workshop,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Basketball Court,Beer Garden,Beer Store,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Bubble Tea Shop,Burger Joint,Bus Station,Bus Stop,Business Service,Cafeteria,Café,Campground,Candy Store,Chinese Restaurant,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Academic Building,College Arts Building,Comic Shop,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Cuban Restaurant,Cycle Studio,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Dive Bar,Dog Run,Donut Shop,Dumpling Restaurant,Electronics Store,Event Space,Exhibit,Farmers Market,Fast Food Restaurant,Fish & Chips Shop,Food & Drink Shop,Football Stadium,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Garden Center,Gas Station,Gastropub,Gay Bar,General Travel,Gift Shop,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Harbor / Marina,Hardware Store,Health & Beauty Service,High School,Historic Site,History Museum,Home Service,Hotel,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Japanese Restaurant,Jewelry Store,Juice Bar,Karaoke Bar,Kids Store,Korean Restaurant,Laundromat,Library,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Marijuana Dispensary,Martial Arts Dojo,Mattress Store,Mediterranean Restaurant,Men's Store,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Motel,Movie Theater,Moving Target,Museum,Music Store,Music Venue,New American Restaurant,Nightclub,Noodle House,Optical Shop,Other Event,Other Repair Shop,Paper / Office Supplies Store,Park,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Pool,Pub,Public Art,Rafting,Ramen Restaurant,Record Shop,Rental Car Location,Rental Service,Residential Building (Apartment / Condo),Restaurant,Rock Club,Russian Restaurant,Salon / Barbershop,Sandwich Place,School,Sculpture Garden,Seafood Restaurant,Shipping Store,Shoe Store,Shop & Service,Shopping Mall,Shopping Plaza,Smoke Shop,Snack Place,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Steakhouse,Storage Facility,Supermarket,Supplement Shop,Sushi Restaurant,Szechuan Restaurant,Taco Place,Tattoo Parlor,Tea Room,Tech Startup,Thai Restaurant,Theater,Theme Park,Thrift / Vintage Store,Toy / Game Store,Trail,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wings Joint,Women's Store,Yoga Studio
0,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,15227,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [101]:
pitt_onehot.shape

(636, 193)

In [102]:
# keeping only the columns associated with restaurants
restaurant_cols = [col for col in pitt_onehot.columns if ('Neighborhood') in col] + [col for col in pitt_onehot.columns if ('Restaurant') in col]
pitt_onehot = pitt_onehot[restaurant_cols]
print(pitt_onehot.shape)
pitt_onehot.head()

(636, 28)


Unnamed: 0,Neighborhood,American Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant,Dumpling Restaurant,Fast Food Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,New American Restaurant,Ramen Restaurant,Restaurant,Russian Restaurant,Seafood Restaurant,Spanish Restaurant,Sushi Restaurant,Szechuan Restaurant,Thai Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,15278,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,15227,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [103]:
pitt_grouped = pitt_onehot.groupby('Neighborhood').mean().reset_index()
print(pitt_grouped.shape)
pitt_grouped

(42, 28)


Unnamed: 0,Neighborhood,American Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant,Dumpling Restaurant,Fast Food Restaurant,French Restaurant,Greek Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,New American Restaurant,Ramen Restaurant,Restaurant,Russian Restaurant,Seafood Restaurant,Spanish Restaurant,Sushi Restaurant,Szechuan Restaurant,Thai Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,15112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15201,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.0,0.020408,0.020408,0.0,0.020408,0.0,0.0,0.0,0.0,0.0,0.0,0.020408
2,15202,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,15203,0.066667,0.044444,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.044444,0.0,0.0,0.0,0.022222,0.0,0.0,0.022222,0.044444,0.0,0.044444,0.0,0.0,0.0
4,15204,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,15205,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,15206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,15207,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,15208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,15209,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Define a function to sort the restaurants in descending order.

In [54]:
def return_most_common_restaurant(row, num_top_restaurant):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_restaurant]

### Create the new dataframe and display the top 5 type of restaurant for each neighborhood.

In [104]:
num_top_restaurant = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top restaurants
columns = ['Neighborhood']
for ind in np.arange(num_top_restaurant):
    try:
        columns.append('{}{} Most Common Restaurant'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Restaurant'.format(ind+1))

# create a new dataframe
neighborhoods_restaurant_sorted = pd.DataFrame(columns=columns)
neighborhoods_restaurant_sorted['Neighborhood'] = pitt_grouped['Neighborhood']

for ind in np.arange(pitt_grouped.shape[0]):
    neighborhoods_restaurant_sorted.iloc[ind, 1:] = return_most_common_restaurant(pitt_grouped.iloc[ind, :], num_top_restaurant)

neighborhoods_restaurant_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,15112,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
1,15201,Vietnamese Restaurant,Seafood Restaurant,Middle Eastern Restaurant,Ramen Restaurant,Restaurant
2,15202,Italian Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
3,15203,American Restaurant,Sushi Restaurant,Asian Restaurant,Mexican Restaurant,Thai Restaurant
4,15204,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant


In [105]:
neighborhoods_restaurant_sorted

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,15112,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
1,15201,Vietnamese Restaurant,Seafood Restaurant,Middle Eastern Restaurant,Ramen Restaurant,Restaurant
2,15202,Italian Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
3,15203,American Restaurant,Sushi Restaurant,Asian Restaurant,Mexican Restaurant,Thai Restaurant
4,15204,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
5,15205,Chinese Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Cuban Restaurant
6,15206,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
7,15207,Seafood Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
8,15208,Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
9,15209,Fast Food Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant


### Cluster neighborhoods

In [106]:
# set number of clusters
kclusters = 5

pitt_grouped_clustering = pitt_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(pitt_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_ 

array([3, 3, 4, 0, 3, 3, 3, 2, 0, 0, 1, 3, 3, 3, 3, 4, 3, 3, 0, 3, 3, 3,
       0, 1, 3, 3, 4, 3, 3, 0, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 0, 3],
      dtype=int32)

### create a new dataframe that includes the cluster as well as the top 5 venues for each neighborhood.

In [107]:
# add clustering labels
neighborhoods_restaurant_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

pitt_merged = pitt_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
pitt_merged = pitt_merged.join(neighborhoods_restaurant_sorted.set_index('Neighborhood'), on='Neighborhood')
#pitt_merged["Cluster Labels"].astype(int)

pitt_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,City,Latitude,Longitude,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,15278,Pittsburgh,40.434436,-80.024817,3.0,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
1,15227,Pittsburgh,40.377869,-79.97516,3.0,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
2,15238,Pittsburgh,40.518701,-79.86744,3.0,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
4,15236,Pittsburgh,40.342869,-79.97929,3.0,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
6,15239,Pittsburgh,40.482655,-79.74278,3.0,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant


In [116]:
pitt_merged_1 = pitt_merged.dropna()
pitt_merged_1['Cluster Labels'] = pitt_merged_1['Cluster Labels'].astype(int)
pitt_merged_1.reset_index(drop=True, inplace=True)
pitt_merged_1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Neighborhood,City,Latitude,Longitude,Cluster Labels,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,15278,Pittsburgh,40.434436,-80.024817,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
1,15227,Pittsburgh,40.377869,-79.97516,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
2,15238,Pittsburgh,40.518701,-79.86744,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
3,15236,Pittsburgh,40.342869,-79.97929,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
4,15239,Pittsburgh,40.482655,-79.74278,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
5,15112,East Pittsburgh,40.399436,-79.83794,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
6,15201,Pittsburgh,40.471468,-79.95726,3,Vietnamese Restaurant,Seafood Restaurant,Middle Eastern Restaurant,Ramen Restaurant,Restaurant
7,15275,Pittsburgh,40.44952,-80.179475,0,Fast Food Restaurant,Mexican Restaurant,Italian Restaurant,Restaurant,Asian Restaurant
8,15216,Pittsburgh,40.400319,-80.03566,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
9,15228,Pittsburgh,40.372802,-80.0448,3,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant


### Visualize the clusters on a map

In [117]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(pitt_merged_1['Latitude'], pitt_merged_1['Longitude'], pitt_merged_1['Neighborhood'], pitt_merged_1['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results and Discussion <a name="results"></a>

### Examine each cluster

#### Cluster 1

In [118]:
pitt_merged.loc[pitt_merged['Cluster Labels'] == 0, pitt_merged.columns[[0] + list(range(5, pitt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
12,15275,Fast Food Restaurant,Mexican Restaurant,Italian Restaurant,Restaurant,Asian Restaurant
35,15203,American Restaurant,Sushi Restaurant,Asian Restaurant,Mexican Restaurant,Thai Restaurant
37,15218,American Restaurant,Fast Food Restaurant,Chinese Restaurant,Seafood Restaurant,Mediterranean Restaurant
38,15229,Chinese Restaurant,Fast Food Restaurant,Italian Restaurant,Asian Restaurant,Vietnamese Restaurant
46,15222,American Restaurant,Italian Restaurant,Restaurant,Indian Restaurant,Korean Restaurant
49,15209,Fast Food Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
55,15208,Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant


#### As the above information shows, in this cluster the most popular restaurants are fast food and American restaurants, if you would like to open one you might want to choose somewhere else to avoid fierce competition

#### Cluster 2

In [119]:
pitt_merged.loc[pitt_merged['Cluster Labels'] == 1, pitt_merged.columns[[0] + list(range(5, pitt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
36,15234,American Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
50,15210,American Restaurant,Fast Food Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
61,15223,American Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant


#### In this cluster, the most common types are American and Mediterranean restaurants, so avoid open one in these neighborhoods.

#### Cluster 3

In [120]:
pitt_merged.loc[pitt_merged['Cluster Labels'] == 2, pitt_merged.columns[[0] + list(range(5, pitt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
78,15207,Seafood Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant


#### This cluster contains only one neighborhood so there is not enough information for us to decide whether it is a good idea to open a specific kind of restaurant but definitely don't try seafood restaurant here.

#### Cluster 4

In [122]:
pitt_merged.loc[pitt_merged['Cluster Labels'] == 3, pitt_merged.columns[[0] + list(range(5, pitt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
0,15278,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
1,15227,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
2,15238,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
4,15236,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
6,15239,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
9,15112,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
10,15201,Vietnamese Restaurant,Seafood Restaurant,Middle Eastern Restaurant,Ramen Restaurant,Restaurant
16,15216,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
22,15228,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant
25,15225,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant,Cuban Restaurant


#### This cluster contains the greatest number of neighborhoods and they follow basically the same pattern. The most popular types are Vietnamese, Mediterranean and Asian, in this order. So maybe Asian restaurant, or be more specific Chinese restaurant, would be a decent choice since there are less competition compared to the previous two.

#### Cluster 5

In [123]:
pitt_merged.loc[pitt_merged['Cluster Labels'] == 4, pitt_merged.columns[[0] + list(range(5, pitt_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Restaurant,2nd Most Common Restaurant,3rd Most Common Restaurant,4th Most Common Restaurant,5th Most Common Restaurant
54,15202,Italian Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
56,15215,Italian Restaurant,Vietnamese Restaurant,Mediterranean Restaurant,Asian Restaurant,Chinese Restaurant
60,15226,Italian Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Asian Restaurant,Chinese Restaurant


#### Finally, there is one cluster filled with Italian restaurant. If you'd like to open a new restaurant, you should avoid this in these neighborhoods.

## Conclusion <a name="conclusion"></a>

In this project, the neighborhoods of Pittsburgh are clustered into different clusters based on the popular types of restaurants in it. If someone wants to open a new restaurant, he/she should avoid the popular choices and aim for the less popular kinds, for example the 4th or 5th ones instead. For consumers, this clustering is also inspiring if they want to find the best restaurant for a specific type. They will know where to go to have the best chance finding a good one. Thanks for watching.  

![alt text][Pitt]

[Pitt]: https://9b16f79ca967fd0708d1-2713572fef44aa49ec323e813b06d2d9.ssl.cf2.rackcdn.com/1140x_a10-7_cTC/20171016dsSkylineLightsLocal-1-1569164618.jpg