# Bloaty's Pizza Hog

## The problem:
Bloaty's Pizza Hog is a pizza place in Pittsburgh's South Side Flats. With their established success, the owners are hoping to open new restaurants in the Pittsburgh area. They want to scope Pittsburgh's neighborhoods to find which ones are the most similar to the South Side Flats so their new locations mirror their current success.

## The proposal:

A model that seems adiquate for this would be a k-means model. We'll find information on the various neighborhoods then group them into clusters. We should then see at least a couply other neighborhoods that are similar to the South Side Flats which may be good candidates for new restaurant locations.

## The data:

### List of neighborhoods
The first thing we'll be needing is a list of the neighborhoods in Pittsburgh. Wikipedia has just such a source of us! Let's take a look at our page:

### The wiki data
We'll be using Wikipedia's list of Pittsburgh's neighborhoods to get our data

In [144]:
from IPython.display import IFrame
IFrame("https://en.wikipedia.org/wiki/List_of_Pittsburgh_neighborhoods", width = 800, height=450)

In [145]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import re
import requests
import numpy as np
from sklearn.cluster import KMeans

Let's start with getting our initial list of neighborhoods and links to thier individual pages where we can get their coordinates

In [146]:
# wiki page with list of Pittsburgh neighborhoods
wiki = "https://en.wikipedia.org/wiki/List_of_Pittsburgh_neighborhoods"

# create the dataframe we'll store our data in
df = pd.DataFrame(columns=["Neighborhood", "wiki", "latitude", "longitude"])

# open the page and create soup object
page = urlopen(wiki)
soup = BeautifulSoup(page, 'html.parser')

# find our list of neighborhoods
neighborhoods_div = soup.find('div', attrs={"class": "div-col columns column-width"})

# get all the "li" tags, and get the names and urls. Add them to the dataframe
lis = neighborhoods_div.findAll('li')
for li in lis:
    a = li.find('a')
    text = a.text.strip()
    url = "https://en.wikipedia.org" + a.attrs.get("href")
    df = df.append({"Neighborhood":text, "wiki":url}, ignore_index=True)

    
df.head()

Unnamed: 0,Neighborhood,wiki,latitude,longitude
0,Allegheny Center,https://en.wikipedia.org/wiki/Allegheny_Center...,,
1,Allegheny West,https://en.wikipedia.org/wiki/Allegheny_West_(...,,
2,Allentown,https://en.wikipedia.org/wiki/Allentown_(Pitts...,,
3,Arlington,https://en.wikipedia.org/wiki/Arlington_(Pitts...,,
4,Arlington Heights,https://en.wikipedia.org/wiki/Arlington_Height...,,


The wiki pages give the coordinates in degrees/minues/seconds. For our API, we'll need them in decimal point. here's how we'll updated those coordinates to something more useful for us:

In [147]:
# method to transform lat and long coordinates from degrees to decimal format
def toDegrees(lat, long):
    # parse the coordinate
    lat = re.split("[\u2032 \u2033 \N{DEGREE SIGN} N]", lat)
    long = re.split("[\u2032 \u2033 \N{DEGREE SIGN} W]", long)
    
    # some neighborhoods only provided degrees and minutes so I created an exception for those
    try:
        # convert strings to floats
        lat = [float(x) for x in lat[0:3]]
        long = [float(x) for x in long[0:3]]
        # the math part
        lat_dec = round(lat[0] + (lat[1]/60) + (lat[2]/3600), 6)
        long_dec = -round(long[0] + (long[1]/60) + (long[2]/3600), 6)
    except:
        # convert strings to floats
        lat = [float(x) for x in lat[0:2]]
        long = [float(x) for x in long[0:2]]
        # the math part
        lat_dec = round(lat[0] + (lat[1]/60), 6)
        long_dec = -round(long[0] + (long[1]/60), 6)
    
    return lat_dec, long_dec

In [148]:
for index, row in df.iterrows():
    # get the latitude and longitude from each wikipedia page
    location_page = urlopen(row["wiki"])
    location_soup = BeautifulSoup(location_page)
    lat = location_soup.find("span", attrs={"class":"latitude"}).text.strip()
    long = location_soup.find("span", attrs={"class":"longitude"}).text.strip()
    
    # convert to decimal
    row["latitude"], row["longitude"] = toDegrees(lat, long)

In [149]:
# there are some neighborhoods that have smaller neighborhoods within them. They therefore have matching latitude and longitude
# let's drop the duplicates
df.drop_duplicates(subset=["latitude", "longitude"], inplace=True)

# and we won't needto wiki links anymore, so let's drop those just to clean up our data frame
df.drop("wiki", axis=1, inplace=True)

In [151]:
df.head(20)

Unnamed: 0,Neighborhood,latitude,longitude
0,Allegheny Center,40.4531,-80.005
1,Allegheny West,40.4521,-80.0158
2,Allentown,40.4211,-79.9939
3,Arlington,40.415,-79.97
5,Banksville,40.4119,-80.0389
6,Bedford Dwellings,40.4453,-79.9797
7,Beechview,40.4136,-80.0225
8,Beltzhoover,40.4161,-80.0031
9,Bloomfield,40.4611,-79.9481
10,Bluff,40.4361,-79.9889


### The Venues Data

For each neighborhood, we'll use Foursquare's API to look up what kind of venues each neighborhood has. This should at least give us a good capture of neighborhood trends.

Our Wikipedia page provides us with a list of neighborhoods, but to use our use Foursquare's API, we'll also need each neighborhood's location. The plan will be to use BeautifulSoup to go through each of the neighborhood links on this page, go to the neighborhood's page, then capture the longitude and latitude from there.

In [152]:
# API credentials
client_id = "JEHUFR3S515TVIJDYY4UCOOARQKZFLLXCKMOCMHGOA1TQVDF"
client_secret = "Q44OSNI3XZVMMIPEANERXXUXJK5KJJZM5KCFHRN3UH3VXMKQ"
version = "20180605"
limit = 100

Here's our method to get the venues for each neighborhood

In [153]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name) # because this method takes a while to run, this will help us see the process made during runtime
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id, 
            client_secret, 
            version, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [155]:
venues_df = getNearbyVenues(df.Neighborhood, df.latitude, df.longitude)

Allegheny Center
Allegheny West
Allentown
Arlington
Banksville
Bedford Dwellings
Beechview
Beltzhoover
Bloomfield
Bluff
Bon Air
Brighton Heights
Brookline
California-Kirkbride
Carrick
Central Business District
Chinatown
Cultural District
Central Lawrenceville
Central Northside
Mexican War Streets
Central Oakland
Chartiers
Chateau
Crafton Heights
Duquesne Heights
East Allegheny
East Carnegie
East Hills
East Liberty
Elliott
Esplen
Fairywood
Fineview
Friendship
Garfield
Glen Hazel
Greenfield
Four Mile Run
Hays
Hazelwood
Highland Park
Homewood North
Knoxville
Larimer
Lincoln–Lemington–Belmar
Lincoln Place
Lower Lawrenceville
Manchester
Marshall-Shadeland
Brunot Island
Morningside
Mount Oliver
Mount Washington
Chatham Village
New Homestead
North Point Breeze
North Shore
Northview Heights
Oakwood
Overbrook
Perry North
Perry South
Point Breeze
Polish Hill
Regent Square
Ridgemont
Saint Clair
Shadyside
Sheraden
Panther Hollow
Southshore
Station Square
South Side Flats
SouthSide Works
South Side

In [156]:
print(venues_df.shape)
venues_df.head(20)

(1407, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allegheny Center,40.453056,-80.005,Children's Museum of Pittsburgh,40.452793,-80.006569,Museum
1,Allegheny Center,40.453056,-80.005,Federal Galley,40.451605,-80.006045,Comfort Food Restaurant
2,Allegheny Center,40.453056,-80.005,Park House,40.453284,-80.001504,Bar
3,Allegheny Center,40.453056,-80.005,El Burro,40.45586,-80.006689,Mexican Restaurant
4,Allegheny Center,40.453056,-80.005,Bistro To Go,40.45345,-80.000995,Deli / Bodega
5,Allegheny Center,40.453056,-80.005,Arnold's Tea,40.453601,-80.000419,Tea Room
6,Allegheny Center,40.453056,-80.005,National Aviary,40.453154,-80.010049,Zoo
7,Allegheny Center,40.453056,-80.005,New Hazlett Theater,40.453081,-80.004844,Performing Arts Venue
8,Allegheny Center,40.453056,-80.005,Max's Allegheny Tavern,40.455322,-79.999919,German Restaurant
9,Allegheny Center,40.453056,-80.005,Priory Fine Pastries,40.453723,-79.999691,Bakery


### Create dummy variables, then restrict to top 5 venue types

In [157]:
# get dummy variables
onehot = pd.get_dummies(venues_df[["Venue Category"]], prefix="", prefix_sep="")

# reinsert neighborhood names
onehot.insert(loc=0, column="Neighborhood", value = venues_df.Neighborhood)

# group by the neighborhoods
onehot = onehot.groupby("Neighborhood").mean().reset_index()

onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Water Park,Wine Bar,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Allegheny Center,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0
1,Allegheny West,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0
2,Allentown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arlington,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Banksville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [158]:
# method to get only top 5 venues for each neighborhood
def get_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [159]:
num_top_venues = 10

indicators = ["st", "nd", "rd"] # for printing 1st, 2nd, 3rd

# create columns according to number of top venues
columns = ["Neighborhood"]
for ind in np.arange(num_top_venues):
    try:
        columns.append("{}{} Most Common Venue".format(ind+1, indicators[ind]))
    except:
        columns.append("{}th Most Common Venue".format(ind+1))

# create new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted["Neighborhood"] = onehot["Neighborhood"]

for ind in np.arange(onehot.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = get_most_common_venues(onehot.iloc[ind, :], num_top_venues)

### The prepped data

In [161]:
neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allegheny Center,Deli / Bodega,Burger Joint,Exhibit,American Restaurant,Sandwich Place,Border Crossing,Brewery,Park,Liquor Store,Café
1,Allegheny West,American Restaurant,Sandwich Place,BBQ Joint,Deli / Bodega,Fast Food Restaurant,Food Truck,Café,Exhibit,Thai Restaurant,Restaurant
2,Allentown,Diner,Italian Restaurant,Beer Store,Discount Store,Coffee Shop,Vegetarian / Vegan Restaurant,Fish & Chips Shop,Farmers Market,Fast Food Restaurant,Film Studio
3,Arlington,American Restaurant,Cosmetics Shop,Pharmacy,Theater,Event Space,Food & Drink Shop,Food,Fondue Restaurant,Fish Market,Fish & Chips Shop
4,Banksville,Pizza Place,Park,Pool,Zoo Exhibit,Ethiopian Restaurant,Food,Fondue Restaurant,Fish Market,Fish & Chips Shop,Film Studio
5,Bedford Dwellings,Food,Gym / Fitness Center,Grocery Store,Coffee Shop,Seafood Restaurant,Electronics Store,Fondue Restaurant,Fish Market,Fish & Chips Shop,Film Studio
6,Beechview,Light Rail Station,Park,Platform,Playground,Supermarket,Taco Place,Ethiopian Restaurant,Fondue Restaurant,Fish Market,Fish & Chips Shop
7,Beltzhoover,Moving Target,Tennis Court,Park,Zoo Exhibit,Ethiopian Restaurant,Food,Fondue Restaurant,Fish Market,Fish & Chips Shop,Film Studio
8,Bloomfield,Bar,Italian Restaurant,Thai Restaurant,Grocery Store,Coffee Shop,Pizza Place,Bookstore,Bakery,Sandwich Place,New American Restaurant
9,Bluff,Pizza Place,Bar,Lounge,Bank,Hotel,Tennis Court,Hockey Arena,Bus Station,Sporting Goods Shop,Outdoor Sculpture


# The Model

## Use k-means to cluster the neighborhoods

In [162]:
k = 10

# create new data frame with dummies for the top venues
cluster_df = pd.get_dummies(neighborhoods_venues_sorted.drop("Neighborhood", 1))

# build k-means model
means = KMeans(n_clusters=k, random_state=0).fit(cluster_df)

# check cluster labels generated
means.labels_

array([0, 7, 6, 1, 9, 3, 1, 9, 4, 0, 1, 8, 9, 9, 8, 5, 4, 5, 3, 5, 6, 9,
       4, 9, 0, 7, 4, 2, 2, 5, 1, 5, 0, 9, 4, 0, 0, 2, 1, 4, 0, 9, 5, 9,
       1, 2, 0, 9, 1, 5, 9, 9, 4, 7, 5, 1, 1, 2, 5, 0, 1, 0, 1, 0, 1, 3,
       9, 4, 3, 0, 5, 1, 1, 4, 6, 4, 3, 8, 9, 9, 1, 5, 8, 1, 1])

### Create new data frame wtih merged data

In [163]:
# start with original data frame
merged_df = neighborhoods_venues_sorted.copy()

# add cluster label
merged_df.insert(1, "Cluster Value", means.labels_)

# add latitude and longitude
merged_df = pd.merge(merged_df, df, on="Neighborhood", how="left")

# discovered bug that changes ints to floats when using merge method
# see https://github.com/pandas-dev/pandas/issues/8596
# let's turn our cluster values back into ints
merged_df["Cluster Value"] = merged_df["Cluster Value"].astype('int')

In [164]:
merged_df

Unnamed: 0,Neighborhood,Cluster Value,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,latitude,longitude
0,Allegheny Center,0,Deli / Bodega,Burger Joint,Exhibit,American Restaurant,Sandwich Place,Border Crossing,Brewery,Park,Liquor Store,Café,40.4531,-80.005
1,Allegheny West,7,American Restaurant,Sandwich Place,BBQ Joint,Deli / Bodega,Fast Food Restaurant,Food Truck,Café,Exhibit,Thai Restaurant,Restaurant,40.4521,-80.0158
2,Allentown,6,Diner,Italian Restaurant,Beer Store,Discount Store,Coffee Shop,Vegetarian / Vegan Restaurant,Fish & Chips Shop,Farmers Market,Fast Food Restaurant,Film Studio,40.4211,-79.9939
3,Arlington,1,American Restaurant,Cosmetics Shop,Pharmacy,Theater,Event Space,Food & Drink Shop,Food,Fondue Restaurant,Fish Market,Fish & Chips Shop,40.415,-79.97
4,Banksville,9,Pizza Place,Park,Pool,Zoo Exhibit,Ethiopian Restaurant,Food,Fondue Restaurant,Fish Market,Fish & Chips Shop,Film Studio,40.4119,-80.0389
5,Bedford Dwellings,3,Food,Gym / Fitness Center,Grocery Store,Coffee Shop,Seafood Restaurant,Electronics Store,Fondue Restaurant,Fish Market,Fish & Chips Shop,Film Studio,40.4453,-79.9797
6,Beechview,1,Light Rail Station,Park,Platform,Playground,Supermarket,Taco Place,Ethiopian Restaurant,Fondue Restaurant,Fish Market,Fish & Chips Shop,40.4136,-80.0225
7,Beltzhoover,9,Moving Target,Tennis Court,Park,Zoo Exhibit,Ethiopian Restaurant,Food,Fondue Restaurant,Fish Market,Fish & Chips Shop,Film Studio,40.4161,-80.0031
8,Bloomfield,4,Bar,Italian Restaurant,Thai Restaurant,Grocery Store,Coffee Shop,Pizza Place,Bookstore,Bakery,Sandwich Place,New American Restaurant,40.4611,-79.9481
9,Bluff,0,Pizza Place,Bar,Lounge,Bank,Hotel,Tennis Court,Hockey Arena,Bus Station,Sporting Goods Shop,Outdoor Sculpture,40.4361,-79.9889


# Our cluster

So far, we created our full cluster data frame. This could be useful in many other circumstances, but our problem only cares about the cluster of neighborhoods most like the South Side Flats. Let's  purge our data and map our neighborhoods!

In [165]:
# South Side Flats Details
merged_df.loc[merged_df.Neighborhood == "South Side Flats"]

Unnamed: 0,Neighborhood,Cluster Value,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,latitude,longitude
67,South Side Flats,4,Bar,Greek Restaurant,Yoga Studio,Rock Club,Dive Bar,Coffee Shop,Sports Bar,Pizza Place,Café,Smoke Shop,40.4288,-79.9856


In [166]:
# create variable with our cluster
south_side_flats_cluster = merged_df.loc[merged_df.Neighborhood == "South Side Flats"].values[0][1]

south_side_flats_cluster

4

In [167]:
# create a new data frame with our candidate neighborhoods
likely_neighborhoods = merged_df.loc[merged_df["Cluster Value"] == south_side_flats_cluster].reset_index()

likely_neighborhoods = pd.DataFrame(likely_neighborhoods.loc[likely_neighborhoods.Neighborhood != "South Side Flats"])

likely_neighborhoods

Unnamed: 0,index,Neighborhood,Cluster Value,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,latitude,longitude
0,8,Bloomfield,4,Bar,Italian Restaurant,Thai Restaurant,Grocery Store,Coffee Shop,Pizza Place,Bookstore,Bakery,Sandwich Place,New American Restaurant,40.4611,-79.9481
1,16,Central Lawrenceville,4,Bar,Sandwich Place,Taco Place,American Restaurant,Café,Sports Bar,Juice Bar,Karaoke Bar,Bowling Alley,Seafood Restaurant,40.4719,-79.9589
2,22,Chinatown,4,Italian Restaurant,Coffee Shop,Pizza Place,Hotel,Restaurant,American Restaurant,Sandwich Place,Plaza,Bar,Theater,40.4417,-80.0
3,26,East Allegheny,4,Park,Deli / Bodega,Pizza Place,Theater,Pharmacy,Concert Hall,Coffee Shop,Event Space,Chinese Restaurant,Sandwich Place,40.4561,-80.0
4,34,Four Mile Run,4,Bar,Water Park,Disc Golf,Theater,Dive Bar,Fast Food Restaurant,Event Space,Exhibit,Eye Doctor,Farmers Market,40.4275,-79.9472
5,39,Hazelwood,4,Bar,Convenience Store,Bakery,Eastern European Restaurant,Pharmacy,Zoo Exhibit,Fast Food Restaurant,Eye Doctor,Farmers Market,Fish & Chips Shop,40.4089,-79.9411
6,52,Mount Washington,4,Bar,Diner,Pizza Place,Bakery,Pharmacy,Liquor Store,Sandwich Place,Sports Bar,Russian Restaurant,Gastropub,40.4281,-80.0111
8,73,Squirrel Hill North,4,Hotel,Coffee Shop,Clothing Store,Cosmetics Shop,Gift Shop,Peruvian Restaurant,Diner,Electronics Store,Event Space,Frozen Yogurt Shop,40.4497,-79.9281
9,75,Station Square,4,Bar,Scenic Lookout,Coffee Shop,Boat or Ferry,Park,Seafood Restaurant,Shopping Mall,Steakhouse,Sports Bar,Nightclub,40.435,-80.0075


# Final list of neighborhoods!

In [168]:
pd.DataFrame(likely_neighborhoods.Neighborhood)

Unnamed: 0,Neighborhood
0,Bloomfield
1,Central Lawrenceville
2,Chinatown
3,East Allegheny
4,Four Mile Run
5,Hazelwood
6,Mount Washington
8,Squirrel Hill North
9,Station Square


## let's graph them!

In [169]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium

In [171]:
pitt_lat = 40.439722
pitt_long = -79.976389

# create map
map_clusters = folium.Map(location=[pitt_lat, pitt_long], zoom_start=13)

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(likely_neighborhoods["latitude"], likely_neighborhoods["longitude"], likely_neighborhoods["Neighborhood"], likely_neighborhoods["Cluster Value"]):
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius = 5,
        popup=label,
        color="black",
        fill=True,
        fill_color="yellow",
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

### Further Considerations
Initially, we captured all the venues for each neighborhood. However, capturing only restaurants may give better insight for the restuarant market. One thing we may want to explore is how a model with restuarant-only venues compares to a model created with all venues. If there's a strong similarity between the two, we can have more confidence that non-restaurant venues correlate to what kinds of restaurants are in the area as well.

We may also want to look at more factors than just venues. Economic factors such as income and home value may be important as well as population factors such as population desity and age distribution. We'll have to explore some further data sources if we'd like to pursue capturing those factors.