### Capstone project

**Introduction/Business Problem**

A few friends have recently started a Russian restaurant in Baltimore, which is located 60 km away from Washington, DC. The restaurant has done very well and they have begun thinking about trying to expand into the Washington area. They are confident that the restaurant can do equally as well if not better in the Washington area. However, this type of branching out would be a very big leap for the company and is critical to their success. However, if the new restaurant fails, this could be detrimental to their original restaurant and possibly cause it to take severe losses or even close down.

Due to the magnitude of this decision, the group wants to do everything they can to make sure they are absolutely confident about opening up the new restaurant. One of the most important tasks with opening up a new restaurant is figuring out the location of this new restaurant. They would like to have a quick, efficient, and overall smart way to find a good location to possibly place the new restaurant. Traditional ways of trying to find a suitable area may take too long or may not be as accurate (contacting a real estate agent, checking local listings, etc). So the group decided to reach out to a data scientist to help them find the best location for the new restaurant.

The way to find the best location for the restaurant is to classify areas in Washington as "clusters" based on similar traits. Features such as types of venues and most popular venues in these areas will help the group identify similar neighborhoods, which will then help the group decide which area is more suitable for the new restaurant. From there, they can decide on specific locations.

**Description of the data that will be used to solve the problem**

Like stated in the introduction, we will have to break Washington up into different areas within the city. After breaking Washington up into these areas, we can further analyze them for types of venues to determine their similarities to each other. How similar they are will determine which cluster (or "group") they get put into. Once you determine which cluster each area belongs to, then you can decide which are the most optimal to place the new restaurant based on the group's criteria.  

Our first step is to find location data for the city of Washington, including zip codes, latitudes, and longitudes. We were able to find a free downloadable online source containing location data for all zip codes in the US, the only stipulation being that we had to mention the source providing this data. You can download the same data from https://simplemaps.com/data/us-zips . While the file contains many attributes for each zip code, we will only be focusing on zip code, latitude, longitude, city, county, and state. Using this information, we will then use Foursquare API to see types of venues and specific venues within these zip codes to help determine the clusters. We will also be using K-means classification to determine the optimal number of clusters, as well as the accuracy for each cities relation to the cluster.

The analysis of this data will be similar to the analysis done in the Segmenting and Clustering Toronto neighborhoods lab. Gathering all of the location data for a city, clustering the neighborhoods within that city, leveraging Foursquare API to determine number of venues and types of venues within each neighborhood, and using that to determine which cluster each neighborhood belongs in is how we completed that lab and how we will complete this analysis.

**Methodology section**

As stated in the Introduction and Data sections above, our objective is clustering all of the data points so that we can determine the best cluster. Once we determine the best cluster, we can pass the data along to the restaurant owners so that they can decide which specific area they like as the best to open up the 2nd restaurant. 

First we must import the necessary libraries in order to properly start analyzing our data. Then, we must read our csv file with all the location data into a pandas dataframe so that we may begin analyzing the data.

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np

In [9]:
zcode=pd.read_csv('uszips.csv')
print(zcode.shape)
zcode.head()

(33099, 16)


Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
0,601,18.18,-66.7522,Adjuntas,PR,Puerto Rico,True,,18570,111.4,72001,Adjuntas,"{'72001':99.43,'72141':0.57}",False,False,America/Puerto_Rico
1,602,18.3607,-67.1752,Aguada,PR,Puerto Rico,True,,41520,523.7,72003,Aguada,{'72003':100},False,False,America/Puerto_Rico
2,603,18.4544,-67.122,Aguadilla,PR,Puerto Rico,True,,54689,667.9,72005,Aguadilla,{'72005':100},False,False,America/Puerto_Rico
3,606,18.1672,-66.9383,Maricao,PR,Puerto Rico,True,,6615,60.4,72093,Maricao,"{'72093':94.88,'72121':1.35,'72153':3.78}",False,False,America/Puerto_Rico
4,610,18.2903,-67.1224,Anasco,PR,Puerto Rico,True,,29016,311.9,72011,Añasco,"{'72003':0.55,'72011':99.45}",False,False,America/Puerto_Rico


We check to see what the different unique states are in this dataframe, and verify that DC is one of those states. Then we filter the dataframe to only show location data of DC cities. We will later focus on Washington zip codes only.

In [10]:
zcode["state_id"].unique()

array(['PR', 'MA', 'RI', 'NH', 'ME', 'VT', 'CT', 'NY', 'NJ', 'PA', 'DE',
       'DC', 'VA', 'MD', 'WV', 'NC', 'SC', 'GA', 'FL', 'AL', 'TN', 'MS',
       'KY', 'OH', 'IN', 'MI', 'IA', 'WI', 'MN', 'SD', 'ND', 'MT', 'IL',
       'MO', 'KS', 'NE', 'LA', 'AR', 'OK', 'TX', 'CO', 'WY', 'ID', 'UT',
       'AZ', 'NM', 'NV', 'CA', 'HI', 'OR', 'WA', 'AK'], dtype=object)

In [11]:
zcode = zcode.drop(zcode.index[zcode['state_id'] != 'DC'])
zcode.head()

Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
6213,20001,38.9108,-77.0178,Washington,DC,District of Columbia,True,,38551,6816.2,11001,District of Columbia,{'11001':100},False,False,America/New_York
6214,20002,38.9051,-76.9843,Washington,DC,District of Columbia,True,,52370,3849.1,11001,District of Columbia,{'11001':100},False,False,America/New_York
6215,20003,38.8812,-76.9906,Washington,DC,District of Columbia,True,,26454,4553.3,11001,District of Columbia,{'11001':100},False,False,America/New_York
6216,20004,38.8949,-77.0287,Washington,DC,District of Columbia,True,,1622,1800.2,11001,District of Columbia,{'11001':100},False,False,America/New_York
6217,20005,38.9047,-77.0315,Washington,DC,District of Columbia,True,,12775,11575.1,11001,District of Columbia,{'11001':100},False,False,America/New_York


In [12]:
# We also want to make sure that there are no military zones left in the dataframe
zcode[zcode['military']!=False].size

0

Here we drop the columns that we deem as unnecessary for further analyzing. We really only need the zipcode, longitude, latitude, city name, state, and count

In [13]:
zcode = zcode.drop(['zcta','parent_zcta','population','density','county_fips','all_county_weights','imprecise','military','timezone'], axis=1)
zcode.head()

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name
6213,20001,38.9108,-77.0178,Washington,DC,District of Columbia,District of Columbia
6214,20002,38.9051,-76.9843,Washington,DC,District of Columbia,District of Columbia
6215,20003,38.8812,-76.9906,Washington,DC,District of Columbia,District of Columbia
6216,20004,38.8949,-77.0287,Washington,DC,District of Columbia,District of Columbia
6217,20005,38.9047,-77.0315,Washington,DC,District of Columbia,District of Columbia


In [14]:
zcode.shape

(52, 7)

In [15]:
# We have 52 zip codes located in DC, now we'll focus on zip codes located in the Washington area
zcode = zcode[zcode['city']=='Washington']
zcode

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name
6213,20001,38.9108,-77.0178,Washington,DC,District of Columbia,District of Columbia
6214,20002,38.9051,-76.9843,Washington,DC,District of Columbia,District of Columbia
6215,20003,38.8812,-76.9906,Washington,DC,District of Columbia,District of Columbia
6216,20004,38.8949,-77.0287,Washington,DC,District of Columbia,District of Columbia
6217,20005,38.9047,-77.0315,Washington,DC,District of Columbia,District of Columbia
6218,20006,38.8986,-77.0413,Washington,DC,District of Columbia,District of Columbia
6219,20007,38.9141,-77.0787,Washington,DC,District of Columbia,District of Columbia
6220,20008,38.9359,-77.0593,Washington,DC,District of Columbia,District of Columbia
6221,20009,38.9199,-77.0375,Washington,DC,District of Columbia,District of Columbia
6222,20010,38.9324,-77.03,Washington,DC,District of Columbia,District of Columbia


In [16]:
zcode.shape

(51, 7)

In [17]:
zcode["city"].unique()

array(['Washington'], dtype=object)

**We're going to create a map of Washington to visually see these cities**

In [18]:
# As we need to import librairies for creating our map, 
# let's also import the other libraries we will use later

import json # library to handle JSON files

import requests # library to handle requests

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means for clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Done')

Done


usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: # uncomment this line if you haven't completed the Foursquare API lab


In [19]:
# We're going to create a map of Washigton, centered around the center of Washington
latitude = 38.88
longitude = -77.0
map_wash = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lati, long, postcode, city in zip(zcode['lat'], zcode['lng'], zcode['zip'], zcode['city']):
    label = '{}, {}'.format(postcode, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lati, long],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_wash)  
    
map_wash

**Work on our second database source**

In [20]:
# Our Foursquare credentials

CLIENT_ID = 'xxx' # your Foursquare ID
CLIENT_SECRET = 'xxx' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

First, we're going to use Foursquare API to help return location data using the location data we already have. We're going to request the top 100 venues around our central point (latitude = 38.88, longitude = -77.0). within a 500 m radius. We create the URL with our credentials and parameters listed below. Once we send the request to Foursquare to get the list of venues, we'll get a JSON file with a list of the results.

In [21]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# lat90048, lon90048 (global variables) were computed before (see above, just before we created the first map of LA) 
# and are the coordinates of our central point (zip code = 90048)  

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL

results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5cfebf654c1f6753b07385ed'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': '$-$$$$', 'key': 'price'},
    {'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Capitol Hill',
  'headerFullLocation': 'Capitol Hill, Washington',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 50,
  'suggestedBounds': {'ne': {'lat': 38.884500004500005,
    'lng': -76.99423016057727},
   'sw': {'lat': 38.8754999955, 'lng': -77.00576983942273}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4ae04a3cf964a5207d7e21e3',
       'name': 'Cornercopia',
       'location': {'address': '1000 3rd St SE',
        'crossStreet': 'at K St.',
        'lat': 38.87833887482248,
        'lng': -77.0020996885

As you can see we have a large list of venues in the Washington area in those specific zip codes. Now we're going to extract the venue's categories from this list, because that's the main feature we're going to use to compare how similar the different zip codes are. Once we determine their similarities we can begin clustering those areas.

In [22]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# now we can clean the json structure and build our dataframe

venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Cornercopia,Deli / Bodega,38.878339,-77.0021
1,Biker Barre,Gym / Fitness Center,38.880068,-76.995964
2,Canal Park,Park,38.877904,-77.00326
3,Capitol Hill Arts Workshop,Art Gallery,38.881452,-76.99635
4,Rose's Luxury,New American Restaurant,38.88066,-76.995175


In [23]:
nearby_venues.shape

(50, 4)

In [24]:
nearby_venues['categories'].unique()

array(['Deli / Bodega', 'Gym / Fitness Center', 'Park', 'Art Gallery',
       'New American Restaurant', 'Grocery Store', 'Donut Shop',
       'Restaurant', 'Wine Shop', 'Food Truck', 'Pizza Place',
       'Greek Restaurant', 'Café', 'Playground', 'Supermarket', 'Plaza',
       'Eastern European Restaurant', 'Coffee Shop', 'Beer Garden',
       'Liquor Store', 'Italian Restaurant', 'Burger Joint',
       'Sushi Restaurant', 'Bakery', 'Steakhouse', 'Pet Store',
       'American Restaurant', 'Bar', 'Hotel', 'Bank', 'Dog Run',
       'Furniture / Home Store', 'Martial Arts Dojo',
       'Bike Rental / Bike Share', 'Sandwich Place', 'Pharmacy',
       "Women's Store", 'Mobile Phone Shop', 'Farmers Market',
       'Cycle Studio'], dtype=object)

The function "getNearbyVenues" will append the useful results. We're adding the parameter "min_venues" to see if any of the requests above have a small number of venues. If the number of venues is deemed too small (i.e. 0), we can double check the zip code to see if we need to explore that zip code more to understand why there are so few venues.

In [25]:
# Create a function to repeat a similar process for all the areas (postal codes)

def getNearbyVenues(postalcodes, latitudes, longitudes, radius=500, min_venues=0):
    
    venues_list=[]
# name is actually postal code name (name -- zipcode)
    for zipcode, lat, lng in zip(postalcodes, latitudes, longitudes):
        
        #print(zipcode)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        #print("results . shape = ", len(results))
        #if len(results) < min_venues : print("number of venues for zipcode {} is low : {} venues.".format(zipcode,len(results)))
        print("number of venues for zipcode {} is : {} venues.".format(zipcode,len(results)))
        
        # return only relevant information for each nearby venue
         ## if len(results) >= min_venues)
        
        for v in results :
                            venues_list.append([(
                                                zipcode, 
                                                lat, 
                                                lng, 
                                                v['venue']['name'], 
                                                v['venue']['location']['lat'], 
                                                v['venue']['location']['lng'],  
                                                v['venue']['categories'][0]['name'])])
        # for v in results if len(results) >= min_venues])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'PC Latitude', 
                  'PC Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
# Now we can run the function above with our database of areas to get the number of venues per zip code

dal_venues = getNearbyVenues(postalcodes=zcode['zip'],
                            latitudes=zcode['lat'],
                            longitudes=zcode['lng'],
                            min_venues=0)

number of venues for zipcode 20001 is : 33 venues.
number of venues for zipcode 20002 is : 12 venues.
number of venues for zipcode 20003 is : 62 venues.
number of venues for zipcode 20004 is : 100 venues.
number of venues for zipcode 20005 is : 100 venues.
number of venues for zipcode 20006 is : 88 venues.
number of venues for zipcode 20007 is : 7 venues.
number of venues for zipcode 20008 is : 37 venues.
number of venues for zipcode 20009 is : 71 venues.
number of venues for zipcode 20010 is : 37 venues.
number of venues for zipcode 20011 is : 13 venues.
number of venues for zipcode 20012 is : 6 venues.
number of venues for zipcode 20015 is : 2 venues.
number of venues for zipcode 20016 is : 18 venues.
number of venues for zipcode 20017 is : 11 venues.
number of venues for zipcode 20018 is : 18 venues.
number of venues for zipcode 20019 is : 7 venues.
number of venues for zipcode 20020 is : 0 venues.
number of venues for zipcode 20024 is : 36 venues.
number of venues for zipcode 20032

In [27]:
print(dal_venues.shape)
dal_venues.head()

(2228, 7)


Unnamed: 0,PostalCode,PC Latitude,PC Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,20001,38.9108,-77.0178,Compass Coffee,38.910569,-77.021703,Coffee Shop
1,20001,38.9108,-77.0178,Dacha Beer Garden,38.91122,-77.022129,Beer Garden
2,20001,38.9108,-77.0178,Grand Cata,38.910816,-77.022,Wine Shop
3,20001,38.9108,-77.0178,Shaw Historical District,38.911573,-77.021865,Neighborhood
4,20001,38.9108,-77.0178,Truxton Inn,38.91388,-77.015838,Cocktail Bar


We show the number of unique venue categories the names of those categories.

In [28]:
dal_venues['PostalCode'].unique().shape
print('There are {} unique categories.'.format(len(dal_venues['Venue Category'].unique())))

There are 242 unique categories.


In [29]:
dal_venues['Venue Category'].unique()

array(['Coffee Shop', 'Beer Garden', 'Wine Shop', 'Neighborhood',
       'Cocktail Bar', 'Spanish Restaurant', 'Thai Restaurant',
       'BBQ Joint', 'Dive Bar', 'Hot Dog Joint', 'Korean Restaurant',
       'Café', 'Wine Bar', 'Ice Cream Shop', 'Bar',
       'Middle Eastern Restaurant', 'Grocery Store', 'Pet Store',
       'Dog Run', 'Market', 'Liquor Store', 'Gym / Fitness Center',
       'Donut Shop', 'Convenience Store', 'Pizza Place', 'Yoga Studio',
       'Bus Stop', 'Park', 'Sandwich Place', 'Indie Theater', 'Bus Line',
       'Diner', 'Gym', 'Moving Target', 'New American Restaurant',
       'Pharmacy', 'Food & Drink Shop', 'Spa', 'Greek Restaurant',
       'Chinese Restaurant', 'Restaurant', 'American Restaurant', 'Plaza',
       'Eastern European Restaurant', 'Italian Restaurant',
       'Asian Restaurant', 'Gay Bar', 'Belgian Restaurant', 'Supermarket',
       'Art Gallery', 'Bakery', 'Mediterranean Restaurant', 'Steakhouse',
       'Sushi Restaurant', 'Seafood Restaurant', '

**Use one-hot encoding to further break down each zip code**

The process we're about to perform is:

We apply one hot encoding to the data frame we created and gather all the venues from each area (with their category). Basically we have a 0 if a category isn't present in that zipcode and a 1 if it is present.

We get rid of the actual lat,lon coordinates of the venues, so that we may stay at the zipcode level. So, we make sure we add the zipcode, lat,lon info to the one hot dataframe.

We group rows by postal code and by taking the mean of the frequency of occurrence of each category, then we can print the 5 most common venue categories in each zip code.

We can put that information for all areas into a dataframe. Using a function to sort the venues categories in descending order, then we create the new dataframe and display the top 10 venues (categories) for each zip code we consider.

In [30]:
# one hot encoding
dal_venues_onehot = pd.get_dummies(dal_venues[['Venue Category']], prefix="", prefix_sep="")

# add postalcode column back to dataframe
dal_venues_onehot['PostalCode'] = dal_venues['PostalCode'] 

# move postalcode column to the first column
fixed_columns = [dal_venues_onehot.columns[-1]] + list(dal_venues_onehot.columns[:-1])
dal_venues_onehot = dal_venues_onehot[fixed_columns]

dal_venues_onehot.head()

Unnamed: 0,PostalCode,Accessories Store,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Beer Bar,Beer Garden,Belgian Restaurant,Bike Rental / Bike Share,Boat or Ferry,Bookstore,Botanical Garden,Brazilian Restaurant,Breakfast Spot,Brewery,Building,Burger Joint,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Cafeteria,Café,Camera Store,Capitol Building,Caribbean Restaurant,Chaat Place,Chinese Restaurant,Chocolate Shop,Christmas Market,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Bookstore,College Cafeteria,College Quad,College Residence Hall,Comedy Club,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Courthouse,Creperie,Cuban Restaurant,Cupcake Shop,Cycle Studio,Deli / Bodega,Dessert Shop,Diner,Discount Store,Dive Bar,Dog Run,Donut Shop,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gas Station,Gastropub,Gay Bar,General Entertainment,German Restaurant,Gift Shop,Gluten-free Restaurant,Golf Course,Golf Driving Range,Gourmet Shop,Government Building,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Halal Restaurant,Harbor / Marina,Health & Beauty Service,Heliport,Historic Site,History Museum,Hostel,Hot Dog Joint,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Intersection,Irish Pub,Israeli Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Latin American Restaurant,Laundry Service,Library,Light Rail Station,Liquor Store,Lounge,Market,Martial Arts Dojo,Massage Studio,Mediterranean Restaurant,Memorial Site,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Modern European Restaurant,Monument / Landmark,Movie Theater,Moving Target,Museum,Music Venue,National Park,Neighborhood,New American Restaurant,Nightclub,Nightlife Spot,Noodle House,Office,Opera House,Optical Shop,Other Repair Shop,Outdoor Sculpture,Outdoors & Recreation,Paper / Office Supplies Store,Park,Parking,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pie Shop,Pizza Place,Planetarium,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Record Shop,Recreation Center,Rental Car Location,Restaurant,River,Roof Deck,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shipping Store,Shopping Mall,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soccer Field,Soccer Stadium,Soup Place,South American Restaurant,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Steakhouse,Student Center,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Theme Park Ride / Attraction,Thrift / Vintage Store,Tiki Bar,Tour Provider,Tourist Information Center,Trail,Train Station,Tree,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Vietnamese Restaurant,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Xinjiang Restaurant,Yoga Studio
0,20001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,20001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,20001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,20001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,20001,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [31]:
dal_venues_onehot.shape

(2228, 243)

In [32]:
dal_venues_grouped = dal_venues_onehot.groupby('PostalCode').mean().reset_index()
dal_venues_grouped.head()

Unnamed: 0,PostalCode,Accessories Store,American Restaurant,Antique Shop,Arcade,Arepa Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Beer Bar,Beer Garden,Belgian Restaurant,Bike Rental / Bike Share,Boat or Ferry,Bookstore,Botanical Garden,Brazilian Restaurant,Breakfast Spot,Brewery,Building,Burger Joint,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Cafeteria,Café,Camera Store,Capitol Building,Caribbean Restaurant,Chaat Place,Chinese Restaurant,Chocolate Shop,Christmas Market,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Bookstore,College Cafeteria,College Quad,College Residence Hall,Comedy Club,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Courthouse,Creperie,Cuban Restaurant,Cupcake Shop,Cycle Studio,Deli / Bodega,Dessert Shop,Diner,Discount Store,Dive Bar,Dog Run,Donut Shop,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Exhibit,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gas Station,Gastropub,Gay Bar,General Entertainment,German Restaurant,Gift Shop,Gluten-free Restaurant,Golf Course,Golf Driving Range,Gourmet Shop,Government Building,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Halal Restaurant,Harbor / Marina,Health & Beauty Service,Heliport,Historic Site,History Museum,Hostel,Hot Dog Joint,Hotel,Hotel Bar,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Intersection,Irish Pub,Israeli Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Latin American Restaurant,Laundry Service,Library,Light Rail Station,Liquor Store,Lounge,Market,Martial Arts Dojo,Massage Studio,Mediterranean Restaurant,Memorial Site,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Mobile Phone Shop,Modern European Restaurant,Monument / Landmark,Movie Theater,Moving Target,Museum,Music Venue,National Park,Neighborhood,New American Restaurant,Nightclub,Nightlife Spot,Noodle House,Office,Opera House,Optical Shop,Other Repair Shop,Outdoor Sculpture,Outdoors & Recreation,Paper / Office Supplies Store,Park,Parking,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Store,Pharmacy,Pie Shop,Pizza Place,Planetarium,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Pub,Ramen Restaurant,Record Shop,Recreation Center,Rental Car Location,Restaurant,River,Roof Deck,Russian Restaurant,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Science Museum,Sculpture Garden,Seafood Restaurant,Shipping Store,Shopping Mall,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soccer Field,Soccer Stadium,Soup Place,South American Restaurant,Southern / Soul Food Restaurant,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Steakhouse,Student Center,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Theme Park Ride / Attraction,Thrift / Vintage Store,Tiki Bar,Tour Provider,Tourist Information Center,Trail,Train Station,Tree,Vegetarian / Vegan Restaurant,Veterinarian,Video Store,Vietnamese Restaurant,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Xinjiang Restaurant,Yoga Studio
0,20001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060606,0.0,0.0,0.0,0.030303,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.030303,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,0.030303,0.0,0.0,0.030303
1,20002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,20003,0.0,0.016129,0.0,0.0,0.0,0.032258,0.0,0.0,0.016129,0.0,0.0,0.0,0.032258,0.0,0.032258,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.016129,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.016129,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.016129,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.032258,0.048387,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.016129,0.016129,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.016129,0.0,0.064516,0.0,0.032258,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.048387,0.0,0.0,0.0,0.0,0.016129,0.0,0.016129,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.016129,0.0,0.0,0.0,0.016129,0.016129,0.0,0.0
3,20004,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.05,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.01,0.03,0.01,0.05,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0
4,20005,0.0,0.04,0.0,0.01,0.01,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.11,0.06,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.03,0.01,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
print(dal_venues_grouped.shape)

(50, 243)


In [34]:
# Following our process described above, we print each area along with the top 5 most common categories of venues¶
num_top_venues = 5

for area in dal_venues_grouped['PostalCode']:
    print("----- {} -----".format(area))
    #print("----"+area+"----")
    temp = dal_venues_grouped[dal_venues_grouped['PostalCode'] == area].T.reset_index()
    temp.columns = ['venue cateory','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----- 20001 -----
     venue cateory  freq
0     Liquor Store  0.09
1  Thai Restaurant  0.09
2        BBQ Joint  0.06
3      Yoga Studio  0.03
4      Pizza Place  0.03


----- 20002 -----
   venue cateory  freq
0          Diner  0.08
1  Moving Target  0.08
2            Gym  0.08
3       Pharmacy  0.08
4            Bar  0.08


----- 20003 -----
          venue cateory  freq
0           Pizza Place  0.06
1  Gym / Fitness Center  0.05
2                   Spa  0.05
3                   Bar  0.03
4    Italian Restaurant  0.03


----- 20004 -----
    venue cateory  freq
0  History Museum  0.06
1           Hotel  0.05
2  Science Museum  0.05
3     Coffee Shop  0.04
4          Bakery  0.04


----- 20005 -----
         venue cateory  freq
0                Hotel  0.11
1          Coffee Shop  0.06
2            Hotel Bar  0.06
3  American Restaurant  0.04
4   Salon / Barbershop  0.04


----- 20006 -----
         venue cateory  freq
0          Coffee Shop  0.11
1       Sandwich Place  0.10
2        

In [35]:
# Create a pandas dataframe and display the top 10 categories of venues for each zip code.
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postalcode_area_venues_sorted = pd.DataFrame(columns=columns)
postalcode_area_venues_sorted['PostalCode'] = dal_venues_grouped['PostalCode']

for ind in np.arange(dal_venues_grouped.shape[0]):
    postalcode_area_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dal_venues_grouped.iloc[ind, :], num_top_venues)


postalcode_area_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,20001,Thai Restaurant,Liquor Store,BBQ Joint,Yoga Studio,Dog Run,Café,Cocktail Bar,Coffee Shop,Convenience Store,Pizza Place
1,20002,Indie Theater,Sandwich Place,Diner,Bus Line,New American Restaurant,Food & Drink Shop,Gym,Liquor Store,Bar,Moving Target
2,20003,Pizza Place,Spa,Gym / Fitness Center,Mobile Phone Shop,Italian Restaurant,Sandwich Place,Chinese Restaurant,Bar,Bakery,Coffee Shop
3,20004,History Museum,Hotel,Science Museum,Coffee Shop,Bakery,American Restaurant,Sandwich Place,Plaza,Exhibit,Steakhouse
4,20005,Hotel,Hotel Bar,Coffee Shop,American Restaurant,Food Truck,Salon / Barbershop,Latin American Restaurant,Deli / Bodega,Sandwich Place,Sushi Restaurant


**K-Means Clustering**

The database is ready to apply the k-means clustering method. We are going to start with a default 5 clusters, similar to what we did in the New York and Toronto labs. 

Based on the 51 total number of zip codes we have, we consider 5 clusters is a good number for classification purposes. As a reminder, our overall objective is to identify the best cluster to select. We will use that best cluster as a guidance and as a list of zip codes to target for our stakeholders to begin finding areas for the new Persian restaurant. So, via reviewing the clustering results, we will estimate the similarity inside each cluster and unsimilarities between two clusters. It will be a key phase when we work with the client and have a shared understanding of the results. In case of more clarity needed, or if we need to get our results more solid, we can run k-means for other values of k and explore further.

In [36]:
# set number of clusters
kclusters = 5

dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1])

In [37]:
# Add clustering labels  
postalcode_area_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

dal_merged = zcode

# merge data frames to add latitude/longitude for each zip code
dal_merged = dal_merged.join(postalcode_area_venues_sorted.set_index('PostalCode'), on='zip', how='inner')

dal_merged.head(20)

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6213,20001,38.9108,-77.0178,Washington,DC,District of Columbia,District of Columbia,0,Thai Restaurant,Liquor Store,BBQ Joint,Yoga Studio,Dog Run,Café,Cocktail Bar,Coffee Shop,Convenience Store,Pizza Place
6214,20002,38.9051,-76.9843,Washington,DC,District of Columbia,District of Columbia,0,Indie Theater,Sandwich Place,Diner,Bus Line,New American Restaurant,Food & Drink Shop,Gym,Liquor Store,Bar,Moving Target
6215,20003,38.8812,-76.9906,Washington,DC,District of Columbia,District of Columbia,0,Pizza Place,Spa,Gym / Fitness Center,Mobile Phone Shop,Italian Restaurant,Sandwich Place,Chinese Restaurant,Bar,Bakery,Coffee Shop
6216,20004,38.8949,-77.0287,Washington,DC,District of Columbia,District of Columbia,0,History Museum,Hotel,Science Museum,Coffee Shop,Bakery,American Restaurant,Sandwich Place,Plaza,Exhibit,Steakhouse
6217,20005,38.9047,-77.0315,Washington,DC,District of Columbia,District of Columbia,0,Hotel,Hotel Bar,Coffee Shop,American Restaurant,Food Truck,Salon / Barbershop,Latin American Restaurant,Deli / Bodega,Sandwich Place,Sushi Restaurant
6218,20006,38.8986,-77.0413,Washington,DC,District of Columbia,District of Columbia,0,Coffee Shop,Sandwich Place,Hotel,Café,American Restaurant,History Museum,Park,Bakery,Mexican Restaurant,Indian Restaurant
6219,20007,38.9141,-77.0787,Washington,DC,District of Columbia,District of Columbia,0,Trail,Fast Food Restaurant,Bus Station,Park,Dog Run,Bagel Shop,Food Truck,Food & Drink Shop,Food Court,Fountain
6220,20008,38.9359,-77.0593,Washington,DC,District of Columbia,District of Columbia,0,Thai Restaurant,Mexican Restaurant,Italian Restaurant,Mediterranean Restaurant,Supplement Shop,Steakhouse,Sports Bar,Gift Shop,Liquor Store,Café
6221,20009,38.9199,-77.0375,Washington,DC,District of Columbia,District of Columbia,0,Bar,Pizza Place,Cosmetics Shop,Jazz Club,Restaurant,Taco Place,Diner,Ethiopian Restaurant,Bakery,Latin American Restaurant
6222,20010,38.9324,-77.03,Washington,DC,District of Columbia,District of Columbia,0,Bar,Vietnamese Restaurant,Bakery,Pizza Place,Asian Restaurant,Yoga Studio,Caribbean Restaurant,Breakfast Spot,Mediterranean Restaurant,Soccer Field


In [38]:
# Recreate the map to show the zip codes with their assigned clusters
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lati, long, city, poi, cluster in zip(dal_merged['lat'], dal_merged['lng'], dal_merged['city'], dal_merged['zip'], dal_merged['Cluster Labels']):
    label = folium.Popup(str(city) + ', ' + str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lati, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.99).add_to(map_clusters)
       
map_clusters

In [39]:
dal_merged.head(5)

Unnamed: 0,zip,lat,lng,city,state_id,state_name,county_name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6213,20001,38.9108,-77.0178,Washington,DC,District of Columbia,District of Columbia,0,Thai Restaurant,Liquor Store,BBQ Joint,Yoga Studio,Dog Run,Café,Cocktail Bar,Coffee Shop,Convenience Store,Pizza Place
6214,20002,38.9051,-76.9843,Washington,DC,District of Columbia,District of Columbia,0,Indie Theater,Sandwich Place,Diner,Bus Line,New American Restaurant,Food & Drink Shop,Gym,Liquor Store,Bar,Moving Target
6215,20003,38.8812,-76.9906,Washington,DC,District of Columbia,District of Columbia,0,Pizza Place,Spa,Gym / Fitness Center,Mobile Phone Shop,Italian Restaurant,Sandwich Place,Chinese Restaurant,Bar,Bakery,Coffee Shop
6216,20004,38.8949,-77.0287,Washington,DC,District of Columbia,District of Columbia,0,History Museum,Hotel,Science Museum,Coffee Shop,Bakery,American Restaurant,Sandwich Place,Plaza,Exhibit,Steakhouse
6217,20005,38.9047,-77.0315,Washington,DC,District of Columbia,District of Columbia,0,Hotel,Hotel Bar,Coffee Shop,American Restaurant,Food Truck,Salon / Barbershop,Latin American Restaurant,Deli / Bodega,Sandwich Place,Sushi Restaurant


Majority of the points are located in Cluster 0, so we will need further analysis to determine if this is cluster is acceptable or if it needs to be broken down more. We can still review each cluster and build the functions we'll use later. We can also review more in details each cluster to see what level of similarity the points have in each cluster, and how different the clusters are.

**Examining each Cluster**

Below we'll examine each of the 5 clusters and see which venue categories distinguish each cluster from the next, as well as the top 10 most common venues for each zip code per cluster

**Cluster 0**

In [40]:
dal_merged.loc[dal_merged['Cluster Labels'] == 0, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6213,20001,District of Columbia,0,Liquor Store,BBQ Joint,Yoga Studio,Dog Run,Café,Cocktail Bar,Coffee Shop,Convenience Store,Pizza Place
6214,20002,District of Columbia,0,Sandwich Place,Diner,Bus Line,New American Restaurant,Food & Drink Shop,Gym,Liquor Store,Bar,Moving Target
6215,20003,District of Columbia,0,Spa,Gym / Fitness Center,Mobile Phone Shop,Italian Restaurant,Sandwich Place,Chinese Restaurant,Bar,Bakery,Coffee Shop
6216,20004,District of Columbia,0,Hotel,Science Museum,Coffee Shop,Bakery,American Restaurant,Sandwich Place,Plaza,Exhibit,Steakhouse
6217,20005,District of Columbia,0,Hotel Bar,Coffee Shop,American Restaurant,Food Truck,Salon / Barbershop,Latin American Restaurant,Deli / Bodega,Sandwich Place,Sushi Restaurant
6218,20006,District of Columbia,0,Sandwich Place,Hotel,Café,American Restaurant,History Museum,Park,Bakery,Mexican Restaurant,Indian Restaurant
6219,20007,District of Columbia,0,Fast Food Restaurant,Bus Station,Park,Dog Run,Bagel Shop,Food Truck,Food & Drink Shop,Food Court,Fountain
6220,20008,District of Columbia,0,Mexican Restaurant,Italian Restaurant,Mediterranean Restaurant,Supplement Shop,Steakhouse,Sports Bar,Gift Shop,Liquor Store,Café
6221,20009,District of Columbia,0,Pizza Place,Cosmetics Shop,Jazz Club,Restaurant,Taco Place,Diner,Ethiopian Restaurant,Bakery,Latin American Restaurant
6222,20010,District of Columbia,0,Vietnamese Restaurant,Bakery,Pizza Place,Asian Restaurant,Yoga Studio,Caribbean Restaurant,Breakfast Spot,Mediterranean Restaurant,Soccer Field


**Cluster 1**

In [41]:
dal_merged.loc[dal_merged['Cluster Labels'] == 1, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6295,20319,District of Columbia,1,Boat or Ferry,Filipino Restaurant,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Fountain,Food Truck
6311,20593,District of Columbia,1,River,Sporting Goods Shop,Brewery,Heliport,Gay Bar,Boat or Ferry,Pizza Place,Soccer Stadium,Indian Restaurant


**Cluster 2**

In [42]:
dal_merged.loc[dal_merged['Cluster Labels'] == 2, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6224,20012,District of Columbia,2,Dive Bar,Bus Station,Gym,Fast Food Restaurant,Yoga Studio,Fountain,Food Court,Food Truck,Fried Chicken Joint


**Cluster 3**

In [43]:
dal_merged.loc[dal_merged['Cluster Labels'] == 3, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6225,20015,District of Columbia,3,Farmers Market,Yoga Studio,Filipino Restaurant,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Fountain


**Cluster 4**

In [44]:
dal_merged.loc[dal_merged['Cluster Labels'] == 4, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6294,20317,District of Columbia,4,Deli / Bodega,Golf Course,Gastropub,Gas Station,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant


It is possible that we don't have enough clusters for the data to be split into. We set the number of clusters to 10 and 20 to see if increasing the number of clusters makes our data more accurate. 

In [45]:
# set number of clusters
kclusters = 10
dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans_k10 = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans_k10.labels_

array([1, 1, 1, 1, 1, 7, 6, 1, 1, 1, 1, 8, 2, 7, 0, 1, 1, 1, 9, 1, 1, 1,
       7, 4, 7, 7, 4, 4, 4, 1, 7, 7, 1, 3, 5, 1, 7, 7, 7, 7, 1, 7, 1, 1,
       7, 4, 4, 4, 1, 1])

In [46]:
kmeans_k10.labels_[4]

1

In [47]:
a=kmeans_k10.labels_
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts))

{0: 1, 1: 23, 2: 1, 3: 1, 4: 7, 5: 1, 6: 1, 7: 13, 8: 1, 9: 1}

In [48]:
# set number of clusters equal to 20
kclusters = 20
dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
kmeans_k20 = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
# kmeans_k20.labels_
unique, counts = np.unique(kmeans_k20.labels_, return_counts=True)
#pd.DataFrame(dict(zip(unique, counts)))

In [49]:
dict(zip(unique, counts))

{0: 1,
 1: 1,
 2: 1,
 3: 16,
 4: 1,
 5: 1,
 6: 1,
 7: 1,
 8: 9,
 9: 1,
 10: 1,
 11: 4,
 12: 1,
 13: 1,
 14: 3,
 15: 2,
 16: 2,
 17: 1,
 18: 1,
 19: 1}

In [50]:
unique, counts = np.unique(kmeans_k20.labels_, return_counts=True)

In [51]:
dal_venues_grouped_clustering = dal_venues_grouped.drop('PostalCode', 1)

for i in ([2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]):

# set number of clusters
    kclusters = i
#LA_venues_grouped_clustering = LA_venues_grouped.drop('PostalCode', 1)

# run k-means clustering
    kmeans_ki = KMeans(n_clusters=kclusters, random_state=0).fit(dal_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
# kmeans_k20.labels_
    unique, counts = np.unique(kmeans_ki.labels_, return_counts=True)
    print("# of points per cluster for {} clusters: ".format(i))
    print(dict(zip(unique, counts)))

# of points per cluster for 2 clusters: 
{0: 49, 1: 1}
# of points per cluster for 3 clusters: 
{0: 1, 1: 1, 2: 48}
# of points per cluster for 4 clusters: 
{0: 5, 1: 43, 2: 1, 3: 1}
# of points per cluster for 5 clusters: 
{0: 45, 1: 2, 2: 1, 3: 1, 4: 1}
# of points per cluster for 6 clusters: 
{0: 1, 1: 21, 2: 1, 3: 25, 4: 1, 5: 1}
# of points per cluster for 7 clusters: 
{0: 1, 1: 25, 2: 20, 3: 1, 4: 1, 5: 1, 6: 1}
# of points per cluster for 8 clusters: 
{0: 7, 1: 1, 2: 1, 3: 1, 4: 1, 5: 24, 6: 1, 7: 14}
# of points per cluster for 9 clusters: 
{0: 6, 1: 27, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 11}
# of points per cluster for 10 clusters: 
{0: 1, 1: 23, 2: 1, 3: 1, 4: 7, 5: 1, 6: 1, 7: 13, 8: 1, 9: 1}
# of points per cluster for 11 clusters: 
{0: 13, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 14, 9: 6, 10: 10}
# of points per cluster for 12 clusters: 
{0: 6, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 9, 7: 6, 8: 1, 9: 1, 10: 1, 11: 21}
# of points per cluster for 13 clusters: 
{0: 1, 1: 

With the further analysis done above, we have determined the number of data points associated with each clusters very cluster numbers varying from 1-20.

Based on the list generated above, it seems as though k=7 is the most optimal number of clusters to use for the data moving forward. With the number of points per cluster being {0: 1, 1: 25, 2: 20, 3: 1, 4: 1, 5: 1, 6: 1}, we can re-run our k-means clustering and display it on the map like we did above. 

In [52]:
kclusters=7
# Recreate the map with k=7 clusters
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lati, long, city, poi, cluster in zip(dal_merged['lat'], dal_merged['lng'], dal_merged['city'], dal_merged['zip'], dal_merged['Cluster Labels']):
    label = folium.Popup(str(city) + ', ' + str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lati, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.99).add_to(map_clusters)
       
map_clusters

As we can see the map has changed. Now we have 7 clusters to assign each zip code to, and we will analyze each cluster below like we did before.

**Cluster 0**

In [53]:
dal_merged.loc[dal_merged['Cluster Labels'] == 0, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6213,20001,District of Columbia,0,Liquor Store,BBQ Joint,Yoga Studio,Dog Run,Café,Cocktail Bar,Coffee Shop,Convenience Store,Pizza Place
6214,20002,District of Columbia,0,Sandwich Place,Diner,Bus Line,New American Restaurant,Food & Drink Shop,Gym,Liquor Store,Bar,Moving Target
6215,20003,District of Columbia,0,Spa,Gym / Fitness Center,Mobile Phone Shop,Italian Restaurant,Sandwich Place,Chinese Restaurant,Bar,Bakery,Coffee Shop
6216,20004,District of Columbia,0,Hotel,Science Museum,Coffee Shop,Bakery,American Restaurant,Sandwich Place,Plaza,Exhibit,Steakhouse
6217,20005,District of Columbia,0,Hotel Bar,Coffee Shop,American Restaurant,Food Truck,Salon / Barbershop,Latin American Restaurant,Deli / Bodega,Sandwich Place,Sushi Restaurant
6218,20006,District of Columbia,0,Sandwich Place,Hotel,Café,American Restaurant,History Museum,Park,Bakery,Mexican Restaurant,Indian Restaurant
6219,20007,District of Columbia,0,Fast Food Restaurant,Bus Station,Park,Dog Run,Bagel Shop,Food Truck,Food & Drink Shop,Food Court,Fountain
6220,20008,District of Columbia,0,Mexican Restaurant,Italian Restaurant,Mediterranean Restaurant,Supplement Shop,Steakhouse,Sports Bar,Gift Shop,Liquor Store,Café
6221,20009,District of Columbia,0,Pizza Place,Cosmetics Shop,Jazz Club,Restaurant,Taco Place,Diner,Ethiopian Restaurant,Bakery,Latin American Restaurant
6222,20010,District of Columbia,0,Vietnamese Restaurant,Bakery,Pizza Place,Asian Restaurant,Yoga Studio,Caribbean Restaurant,Breakfast Spot,Mediterranean Restaurant,Soccer Field


**Cluster 1**

In [54]:
dal_merged.loc[dal_merged['Cluster Labels'] == 1, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6295,20319,District of Columbia,1,Boat or Ferry,Filipino Restaurant,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Fountain,Food Truck
6311,20593,District of Columbia,1,River,Sporting Goods Shop,Brewery,Heliport,Gay Bar,Boat or Ferry,Pizza Place,Soccer Stadium,Indian Restaurant


**Cluster 2**

In [55]:
dal_merged.loc[dal_merged['Cluster Labels'] == 2, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6224,20012,District of Columbia,2,Dive Bar,Bus Station,Gym,Fast Food Restaurant,Yoga Studio,Fountain,Food Court,Food Truck,Fried Chicken Joint


**Cluster 3**

In [56]:
dal_merged.loc[dal_merged['Cluster Labels'] == 3, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6225,20015,District of Columbia,3,Farmers Market,Yoga Studio,Filipino Restaurant,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Fountain


**Cluster 4**

In [57]:
dal_merged.loc[dal_merged['Cluster Labels'] == 4, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6294,20317,District of Columbia,4,Deli / Bodega,Golf Course,Gastropub,Gas Station,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant


**Cluster 5**

In [58]:
dal_merged.loc[dal_merged['Cluster Labels'] == 5, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


**Cluster 6**

In [59]:
dal_merged.loc[dal_merged['Cluster Labels'] == 6, dal_merged.columns[list(range(0,1)) + [6,7] + list(range(9, dal_merged.shape[1]))]]

Unnamed: 0,zip,county_name,Cluster Labels,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue


**We choose Cluster 0 as our cluster for finding the new restaurant**

This re-affirms that we will choose cluster 0 as our cluster, which will determine the exact zip code where the new Russian restaurant will go. After we run our kmeans algorithm for many values of k (from k=1 to k=20), we see that the more discriminating result is with k=7. Number of points per cluster for k=7 {0: 45, 1: 2, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0} We'll explore the results in the Results section of this report. At this stage we realized that the number of clusters was crucial in the way our kmeans algorithm could actually efficiently do some classifying tasks. We've also seen how we had to work on the features we extracted got with the Foursquare API to make more relevant features to solve our problem. It is both thanks to working on many values of k for kmeans and thanks to features modifications that we could reach the results we present in the next section of our report.