# IBM Data Science Professional Certificate Capstone

This notebook is for the neighborhood analyzation project for the data science capstone course on Coursera.

## Introduction
Gainesville, FL is Florida's 15th largest city with 134,945 people within the city and 295,266 in the overall area, and it has grown 8% between 2010 and 2020. As it continues to grow, I know there will be opportunities for new restaurants, including new bakeries. 

To identify the best potential areas for our new bakery, I will analyze the different parts of the city to determine what areas would have the least competition for the bakery. However, Gainesville's challenge is that the population is spread out into several dense areas and many less dense regions. 

As such, I want to identify areas that have low competition for a bakery but also have enough people in the area to support a new establishment.

You can read more about this analysis by reading [my full report](https://github.com/fpcorso/Coursera_Capstone/blob/master/ibm-data-science-capstone-report.pdf).

## Part 1 - Identifying our potential locations

In [3]:
# Our needed imports.
#!conda install -c conda-forge folium --yes
!conda install -c conda-forge geopy --yes
import folium
import ibm_boto3
import json
import math
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np
import pandas as pd
import requests
import types
from botocore.client import Config
from geopy.geocoders import GoogleV3
from IPython.display import Image 
from sklearn.cluster import KMeans

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

    geographiclib: 1.50-py_0           conda-forge
    geopy:         1.22.0-pyh9f0ad1d_0 conda-forge


Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ##################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

In [4]:
# Create our corners of Gainesville.
gainesville_north = 29.711381
gainesville_south = 29.596737
gainesville_west = -82.453961
gainesville_east = -82.262119

In [5]:
# Define how many rows and columns we want to create for potential locations.
LOCATION_ROWS = 11
LOCATION_COLUMNS = 16
GAINESVILLE_LATITUDE = 29.662737
GAINESVILLE_LONGITUDE = -82.370212

In [6]:
# Calculate how big each location is.
lat_diff = gainesville_north - gainesville_south
long_diff = gainesville_west - gainesville_east
lat_segment = lat_diff / (LOCATION_ROWS)
long_segment = long_diff / (LOCATION_COLUMNS)

In [7]:
# Generate the center for all locations.
gainesville_locations = pd.DataFrame(columns=['Location', 'Lat', 'Long'])
north_boundary = gainesville_north
for row in range(LOCATION_ROWS):
    south_boundary = north_boundary - lat_segment
    row_center = (north_boundary + south_boundary) / 2
    west_boundary = gainesville_west
    for column in range(LOCATION_COLUMNS):
        east_boundary = west_boundary - long_segment
        column_center = (east_boundary + west_boundary) / 2
        west_boundary = east_boundary
        gainesville_locations = gainesville_locations.append(pd.Series(['{}-{}'.format(row, column), row_center, column_center], index=gainesville_locations.columns), ignore_index=True)
    north_boundary = south_boundary
gainesville_locations.head()

Unnamed: 0,Location,Lat,Long
0,0-0,29.70617,-82.447966
1,0-1,29.70617,-82.435976
2,0-2,29.70617,-82.423986
3,0-3,29.70617,-82.411996
4,0-4,29.70617,-82.400005


In [8]:
# Count our potential locations
LOCATION_COUNT = len(gainesville_locations)
LOCATION_COUNT

176

In [10]:
# Function for calculating distance
def calculate_distance(from_lat, from_long, to_lat, to_long):
    """Calculates distance in meters between two points."""
    from_coords = (from_lat, from_long)
    to_coords = (to_lat, to_long)
    
    # Convert to radians
    from_radians = [math.radians(coord) for coord in from_coords]
    to_radians = [math.radians(coord) for coord in to_coords]
    delta_longitudes = to_radians[1] - from_radians[1]
    
    # Calculate using Haversine formula
    angle_degrees = 2 * math.asin(
        math.sqrt(
            math.pow(math.sin((to_radians[0] - from_radians[0])/2), 2) +
            math.cos(from_radians[0]) * math.cos(to_radians[0]) * math.pow(math.sin(delta_longitudes/2), 2)
        )
    )
    
    # Convert to meters to distance
    return angle_degrees * 6372795

In [11]:
# Calculate radius of each location

# Get our coordinates
first = gainesville_locations.iloc[0]
second = gainesville_locations.iloc[1]
diameter = calculate_distance(first['Lat'], first['Long'], second['Lat'], second['Long'])

# Convert to meters to get diameter and then divide by 2 to get radius
LOCATION_RADIUS = round((diameter) / 2)
print('{} meters'.format(LOCATION_RADIUS))

579 meters


In [12]:
# Create map of Gainesville to see our locations.
general_map = folium.Map(location=[GAINESVILLE_LATITUDE, GAINESVILLE_LONGITUDE], zoom_start=12)

# Add markers to the map for each locations.
for index, row in gainesville_locations.iterrows():
    folium.CircleMarker(
        [row['Lat'], row['Long']],
        radius=17,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(general_map)  
    
general_map

## Part 2 - Getting location median home values

In [13]:
# Set the default median value for locations not in a neighborhood
GAINESVILLE_MEDIAN_HOME_VALUE = 207069

### Load our home values per neighbord data from [Zillow Research](https://www.zillow.com/research/data/)
Uploaded to the IBM Cloud, so the actual loading is hidden.

In [14]:
# The code was removed by Watson Studio for sharing.

Set our Google Maps API to a constant

In [15]:
# The code was removed by Watson Studio for sharing.

In [16]:
# Check what our home values dataframe looks like
print(neighborhood_home_values.shape)
neighborhood_home_values.head()

(16135, 302)


Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,1996-01-31,...,2019-08-31,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31
0,274772,0,Northeast Dallas,Neighborhood,TX,TX,Dallas,Dallas-Fort Worth-Arlington,Dallas County,134205.0,...,328875.0,330561.0,331099.0,332078.0,331280.0,330929.0,330317.0,330090.0,330921.0,332149.0
1,112345,1,Maryvale,Neighborhood,AZ,AZ,Phoenix,Phoenix-Mesa-Scottsdale,Maricopa County,,...,186433.0,187736.0,188862.0,189907.0,191307.0,193237.0,195086.0,197237.0,199661.0,201983.0
2,192689,2,Paradise,Neighborhood,NV,NV,Las Vegas,Las Vegas-Henderson-Paradise,Clark County,139879.0,...,267685.0,267385.0,267797.0,268742.0,269426.0,270436.0,271111.0,272900.0,274356.0,275110.0
3,270958,3,Upper West Side,Neighborhood,NY,NY,New York,New York-Newark-Jersey City,New York County,248080.0,...,1237440.0,1224437.0,1218007.0,1220612.0,1228102.0,1226980.0,1220267.0,1207472.0,1205337.0,1202352.0
4,118208,4,South Los Angeles,Neighborhood,CA,CA,Los Angeles,Los Angeles-Long Beach-Anaheim,Los Angeles County,135698.0,...,514803.0,517905.0,521305.0,524199.0,527628.0,531100.0,535542.0,539881.0,543171.0,544092.0


In [17]:
# Get just Gainesville home values
gainesville_home_values = neighborhood_home_values[(neighborhood_home_values['City'] == 'Gainesville') & (neighborhood_home_values['State'] == 'FL')]
print(gainesville_home_values.shape)
gainesville_home_values.head()

(48, 302)


Unnamed: 0,RegionID,SizeRank,RegionName,RegionType,StateName,State,City,Metro,CountyName,1996-01-31,...,2019-08-31,2019-09-30,2019-10-31,2019-11-30,2019-12-31,2020-01-31,2020-02-29,2020-03-31,2020-04-30,2020-05-31
3963,266011,4089,University Park,Neighborhood,FL,FL,Gainesville,Gainesville,Alachua County,124755.0,...,299220.0,301109.0,302805.0,302588.0,302788.0,303843.0,305473.0,306169.0,305674.0,306186.0
5500,254905,5680,Ashton,Neighborhood,FL,FL,Gainesville,Gainesville,Alachua County,154044.0,...,326901.0,327430.0,328306.0,328586.0,328520.0,329184.0,332157.0,335108.0,337356.0,338884.0
5577,761454,5761,Capri,Neighborhood,FL,FL,Gainesville,Gainesville,Alachua County,121154.0,...,228901.0,230508.0,232351.0,232715.0,232695.0,233082.0,235105.0,236986.0,238151.0,239360.0
7648,761433,7943,Gateway Park,Neighborhood,FL,FL,Gainesville,Gainesville,Alachua County,71320.0,...,137142.0,138523.0,139308.0,140332.0,141023.0,142507.0,144413.0,146245.0,147269.0,148293.0
8043,761431,8359,Highland Court Manor,Neighborhood,FL,FL,Gainesville,Gainesville,Alachua County,57558.0,...,126898.0,128794.0,130385.0,131719.0,132665.0,134166.0,136753.0,139152.0,141438.0,143722.0


In [18]:
# Creates our simplified neighborhood dataframe
gainesville_neighborhoods = pd.DataFrame(columns=['Neighborhood', 'Home Value', 'Lat', 'Long'])

In [20]:
# Cycles through our home value list
geolocator = GoogleV3(api_key=GOOGLE_MAPS_API_KEY)
for index, row in gainesville_home_values.iterrows():
    # Gets latitude and longitude from Google Maps
    location = geolocator.geocode('{}, Gainesville, FL'.format(row['RegionName']))
    
    # Appends our neighborhood to our simplified dataframe
    gainesville_neighborhoods = gainesville_neighborhoods.append(pd.Series([
            row['RegionName'],
            row['2020-05-31'],
            location.latitude,
            location.longitude], index=gainesville_neighborhoods.columns), ignore_index=True)

In [24]:
# Check our new dataframe
print(gainesville_neighborhoods.shape)
gainesville_neighborhoods.head()

(48, 4)


Unnamed: 0,Neighborhood,Home Value,Lat,Long
0,University Park,306186.0,29.657749,-82.347902
1,Ashton,338884.0,29.647844,-82.335316
2,Capri,239360.0,29.695226,-82.373666
3,Gateway Park,148293.0,29.667723,-82.333659
4,Highland Court Manor,143722.0,29.678898,-82.315343


In [23]:
# Create map of Gainesville to see our neighborhoods.
neighborhood_map = folium.Map(location=[GAINESVILLE_LATITUDE, GAINESVILLE_LONGITUDE], zoom_start=12)

# Add markers to the map for each neighborhoods.
for index, row in gainesville_neighborhoods.iterrows():
    folium.CircleMarker(
        [row['Lat'], row['Long']],
        radius=5,
        color='blue',
        fill=True,
        fill_color='#E38B02',
        fill_opacity=0.7,
        parse_html=False).add_to(neighborhood_map)  
    
neighborhood_map

### Add median home value to each potential location.

First, we default all locations to the Gainesville median value. Then, if we determine the location is close enough to a neighborhood, we can then set the new value.

In [29]:
gainesville_locations['Neighborhood'] = 'None'
gainesville_locations['Home Value'] = GAINESVILLE_MEDIAN_HOME_VALUE

# Cycle over each potential location
for index, row in gainesville_locations.iterrows():
    
    # Calculate distances to each neighborhood
    distances = []
    for neighborhood_index, neighborhood_row in gainesville_neighborhoods.iterrows():
        distance = calculate_distance(row['Lat'], row['Long'], neighborhood_row['Lat'], neighborhood_row['Long'])
        distances.append((neighborhood_row['Neighborhood'], distance, neighborhood_row['Home Value']))
                         
    # Identify closest neighborhood
    closest = sorted(distances, key = lambda x: x[1])[0]
    
    # If the neighborhood is less than 3km from location, save it instead of Gainesville median value
    if closest[1] < 3000:
        gainesville_locations.loc[index, 'Neighborhood'] = closest[0]
        gainesville_locations.loc[index, 'Home Value'] = closest[2]

In [30]:
# Check out our updated dataframe
gainesville_locations.head(20)

Locations with no neighborhood: 48


Unnamed: 0,Location,Lat,Long,Neighborhood,Home Value
0,0-0,29.70617,-82.447966,,207069.0
1,0-1,29.70617,-82.435976,,207069.0
2,0-2,29.70617,-82.423986,Kensington Park,556897.0
3,0-3,29.70617,-82.411996,Kensington Park,556897.0
4,0-4,29.70617,-82.400005,Kensington Park,556897.0
5,0-5,29.70617,-82.388015,Kensington Park,556897.0
6,0-6,29.70617,-82.376025,Northwood,199357.0
7,0-7,29.70617,-82.364035,Northwood,199357.0
8,0-8,29.70617,-82.352045,Hazel Heights,176408.0
9,0-9,29.70617,-82.340055,Hazel Heights,176408.0


## Part 3 - Getting nearby businesses for each location

In [31]:
# The code was removed by Watson Studio for sharing.

In [32]:
# Prepares our venue DataFrame.
location_venues = pd.DataFrame(columns=[
                            'Location',
                            'Lat',
                            'Long', 
                            'Venue', 
                            'Venue Latitude', 
                            'Venue Longitude', 
                            'Venue Category'])

In [33]:
# Function for getting all venues in an area
def get_venues(lat, long, limit):
    # create the API request URL.
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION,
        lat, 
        long, 
        LOCATION_RADIUS, 
        limit)
    
    # Load our results.
    r = requests.get(url)
    results = r.json()
    
    # Get the venues.
    try:
        venues = results["response"]['groups'][0]['items']
        return venues
    except KeyError:
        print('Trouble finding venues for {}. Returned response was:'.format(row['Location']), results["response"])
    return []
        

In [34]:
for index, row in gainesville_locations.iterrows():
    if index % 10 == 0:
        print('Location {} of {}...'.format(index, LOCATION_COUNT))
    venues = get_venues(row['Lat'], row['Long'], 100)
    # Add each venue to our DataFrame.
    for venue in venues:
        location_venues = location_venues.append(pd.Series([
            row['Location'],
            row['Lat'],
            row['Long'],
            venue['venue']['name'],
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']], index=location_venues.columns), ignore_index=True)

Location 0 of 176...
Location 10 of 176...
Location 20 of 176...
Location 30 of 176...
Location 40 of 176...
Location 50 of 176...
Location 60 of 176...
Location 70 of 176...
Location 80 of 176...
Location 90 of 176...
Location 100 of 176...
Location 110 of 176...
Location 120 of 176...
Location 130 of 176...
Location 140 of 176...
Location 150 of 176...
Location 160 of 176...
Location 170 of 176...


In [35]:
# Quick preview of our venues.
print(location_venues.shape)
location_venues.head()

(962, 7)


Unnamed: 0,Location,Lat,Long,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0-0,29.70617,-82.447966,The Hammock Lake,29.707468,-82.44342,Lake
1,0-2,29.70617,-82.423986,Flying Ten Airport-OJ8,29.702759,-82.425678,Airport Terminal
2,0-3,29.70617,-82.411996,Calusa Animal Inn,29.705234,-82.410398,Pet Store
3,0-3,29.70617,-82.411996,McGriff Landscaping and Fencing,29.708351,-82.417159,Construction & Landscaping
4,0-4,29.70617,-82.400005,Devil's Millhopper Geological State Park,29.70541,-82.39441,State / Provincial Park


## Part 4 - Get just nearby restaurants for each location

In [36]:
# Sets up our categories
food_category = '4d4b7105d754a06374d81259'

In [37]:
# Prepares our venue DataFrame.
location_restaurants = pd.DataFrame(columns=[
                            'Location',
                            'Lat',
                            'Long',
                            'Venue', 
                            'Venue Latitude', 
                            'Venue Longitude', 
                            'Venue Category'])

In [38]:
# Function for getting all venues in an area
def get_restaurants(lat, long, limit):
    # create the API request URL.
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&categoryId={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION,
        food_category,
        lat, 
        long, 
        LOCATION_RADIUS, 
        limit)
    
    # Load our results.
    r = requests.get(url)
    results = r.json()
    
    # Get the venues.
    try:
        venues = results["response"]['groups'][0]['items']
        return venues
    except KeyError:
        print('Trouble finding venues for {}. Returned response was:'.format(row['Location']), results["response"])
    return []
        

In [39]:
for index, row in gainesville_locations.iterrows():
    if index % 10 == 0:
        print('Location {} of {}...'.format(index, LOCATION_COUNT))
    venues = get_restaurants(row['Lat'], row['Long'], 100)
    # Add each venue to our DataFrame.
    for venue in venues:
        location_restaurants = location_restaurants.append(pd.Series([
            row['Location'],
            row['Lat'],
            row['Long'],
            venue['venue']['name'],
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']], index=location_restaurants.columns), ignore_index=True)

Location 0 of 176...
Location 10 of 176...
Location 20 of 176...
Location 30 of 176...
Location 40 of 176...
Location 50 of 176...
Location 60 of 176...
Location 70 of 176...
Location 80 of 176...
Location 90 of 176...
Location 100 of 176...
Location 110 of 176...
Location 120 of 176...
Location 130 of 176...
Location 140 of 176...
Location 150 of 176...
Location 160 of 176...
Location 170 of 176...


In [40]:
# Quick preview of our venues.
print(location_restaurants.shape)
location_restaurants.head()

(442, 7)


Unnamed: 0,Location,Lat,Long,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0-5,29.70617,-82.388015,Piesanos Stone Fired Pizza,29.701873,-82.390284,Pizza Place
1,0-5,29.70617,-82.388015,China Bowl,29.702655,-82.390303,Chinese Restaurant
2,0-5,29.70617,-82.388015,Cedar River Seafood,29.701723,-82.387995,Seafood Restaurant
3,0-5,29.70617,-82.388015,SUBWAY,29.702775,-82.390566,Sandwich Place
4,0-5,29.70617,-82.388015,Little Caesars Pizza,29.701369,-82.387991,Pizza Place


In [41]:
# Let's see how many locations have at least one restaurant.
print('Total locations with at least one restaurant: {}'.format(len(location_restaurants.groupby('Location').count())))

Total locations with at least one restaurant: 80


## Part 5 - Determining which locations are best for new bakery

### First, cluster our districts using Kmeans

In [42]:
# Get our dummified categories.
venue_dummified = pd.get_dummies(location_venues[['Venue Category']], prefix="", prefix_sep="")

# Add our location back to dataframe.
venue_dummified['Location'] = location_venues['Location'] 

# Move location column to the beginning.
# Thanks to https://stackoverflow.com/a/56479671 😅
venue_dummified = venue_dummified[ ['Location'] + [ col for col in venue_dummified.columns if col != 'Location' ] ]

In [43]:
# Review our dataframe.
print('Shape:', venue_dummified.shape)
venue_dummified.head()

Shape: (962, 229)


Unnamed: 0,Location,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Arcade,Art Gallery,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wings Joint,Women's Store,Yoga Studio
0,0-0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0-2,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0-3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0-3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0-4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [62]:
# Calculate our mean venue categories per location.
venue_groups = venue_dummified.groupby('Location').mean().reset_index()
print('Shape:',venue_groups.shape)
venue_groups.head()

Shape: (141, 229)


Unnamed: 0,Location,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Arcade,Art Gallery,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Wings Joint,Women's Store,Yoga Studio
0,0-0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0-13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0-15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
num_top_venues = 10

# Create columns according to number of top venues.
indicators = ['st', 'nd', 'rd']
columns = ['Location']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new empty dataframe with our new columns and add in our locations.
location_venues_sorted = pd.DataFrame(columns=columns)
location_venues_sorted['Location'] = venue_groups['Location']

# Cycle over location groups...
for index, row in venue_groups.iterrows():
    # And add in num_top_venues of the top venue categories to each location.
    location_venues_sorted.iloc[index, 1:] = row.iloc[1:].sort_values(ascending=False).index.values[0:num_top_venues]

location_venues_sorted.head()

Unnamed: 0,Location,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0-0,Lake,Yoga Studio,Financial or Legal Service,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
1,0-10,Electronics Store,Yoga Studio,Fish & Chips Shop,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
2,0-13,Donut Shop,Yoga Studio,Fish & Chips Shop,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
3,0-14,Pet Store,Convenience Store,Yoga Studio,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service,Food Court
4,0-15,Baseball Field,Yoga Studio,Fish Market,Garden Center,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck


In [64]:
# Our number of clusters.
kclusters = 10

# Calculate our KMeans.
location_groups_clustering = venue_groups.drop('Location', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(location_groups_clustering)

In [65]:
# Add our clustering labels to our dataframe.
location_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [66]:
# Start preparing our final dataframe.
location_df_final = gainesville_locations.copy()

# Merge in our location clustering results.
location_df_final = location_df_final.join(location_venues_sorted.set_index('Location'), on='Location')

# If any location didn't have venues or ended with NaN scores, let's drop it.
location_df_final = location_df_final.dropna()

# Make sure the cluster labels are in int for our calculations.
location_df_final['Cluster Labels'] = location_df_final['Cluster Labels'].astype('int32')

location_df_final.head()

Unnamed: 0,Location,Lat,Long,Neighborhood,Home Value,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0-0,29.70617,-82.447966,,207069.0,1,Lake,Yoga Studio,Financial or Legal Service,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
2,0-2,29.70617,-82.423986,Kensington Park,556897.0,1,Airport Terminal,Yoga Studio,Fish & Chips Shop,Garden Center,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck
3,0-3,29.70617,-82.411996,Kensington Park,556897.0,9,Pet Store,Construction & Landscaping,Yoga Studio,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service,Food Court
4,0-4,29.70617,-82.400005,Kensington Park,556897.0,1,State / Provincial Park,Construction & Landscaping,Tennis Court,Yoga Studio,Fish & Chips Shop,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck
5,0-5,29.70617,-82.388015,Kensington Park,556897.0,1,Pizza Place,Pharmacy,Music Store,Salon / Barbershop,Food,Chinese Restaurant,Grocery Store,Seafood Restaurant,Bank,Sandwich Place


In [67]:
# Create our map.
map_clusters = folium.Map(location=[GAINESVILLE_LATITUDE, GAINESVILLE_LONGITUDE], zoom_start=10)

# Set up different colors for each cluster
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add each location as a marker on the map.
markers_colors = []
for lat, lon, poi, cluster in zip(location_df_final['Lat'], location_df_final['Long'], location_df_final['Location'], location_df_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Next, determine which clusters contain the most bakeries

We'll use this to determine which type of location best works for a bakery.

In [68]:
# Now, get the locations with a bakery in it.
bakery_locations = location_venues[location_venues['Venue Category'] == 'Bakery']
print('Total locations with bakeries: {}'.format(len(bakery_locations.groupby('Location').count())))
bakery_locations.head()

Total locations with bakeries: 6


Unnamed: 0,Location,Lat,Long,Venue,Venue Latitude,Venue Longitude,Venue Category
36,0-8,29.70617,-82.352045,Walmart Bakery,29.70644,-82.35684,Bakery
169,3-5,29.674903,-82.388015,Uppercrust,29.674301,-82.387022,Bakery
425,5-3,29.654059,-82.411996,Cinnabon,29.656574,-82.411222,Bakery
695,7-6,29.633215,-82.376025,Midnight Cookies,29.633088,-82.373505,Bakery
789,8-6,29.622792,-82.376025,Panera Bread,29.625161,-82.373782,Bakery


In [69]:
# Determine which cluster has most bakeries in it.
bakery_locations.merge(location_df_final)['Cluster Labels'].value_counts()

1    6
Name: Cluster Labels, dtype: int64

### Next, determine which locations in that cluster are ideal locations

In [70]:
# Get all locations with at least one restaurant.
locations_with_restaurants = location_df_final[location_df_final['Location'].isin(location_restaurants['Location'])]
print(locations_with_restaurants.shape)
locations_with_restaurants.head()

(78, 16)


Unnamed: 0,Location,Lat,Long,Neighborhood,Home Value,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,0-5,29.70617,-82.388015,Kensington Park,556897.0,1,Pizza Place,Pharmacy,Music Store,Salon / Barbershop,Food,Chinese Restaurant,Grocery Store,Seafood Restaurant,Bank,Sandwich Place
7,0-7,29.70617,-82.364035,Northwood,199357.0,1,Breakfast Spot,Yoga Studio,Garden Center,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
8,0-8,29.70617,-82.352045,Hazel Heights,176408.0,1,Sandwich Place,Convenience Store,Bakery,Park,Farmers Market,Liquor Store,Construction & Landscaping,Big Box Store,Video Game Store,Video Store
9,0-9,29.70617,-82.340055,Hazel Heights,176408.0,1,Breakfast Spot,Intersection,Business Service,Concert Hall,Sandwich Place,Fish & Chips Shop,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant
13,0-13,29.70617,-82.292094,,207069.0,1,Donut Shop,Yoga Studio,Fish & Chips Shop,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service


In [71]:
# Get all locations with at least one restaurant within cluster 1 (the one with most bakeries).
cluster_one_restaurant_locations = locations_with_restaurants[locations_with_restaurants['Cluster Labels'] == 1]
print(cluster_one_restaurant_locations.shape)
cluster_one_restaurant_locations.head()

(70, 16)


Unnamed: 0,Location,Lat,Long,Neighborhood,Home Value,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,0-5,29.70617,-82.388015,Kensington Park,556897.0,1,Pizza Place,Pharmacy,Music Store,Salon / Barbershop,Food,Chinese Restaurant,Grocery Store,Seafood Restaurant,Bank,Sandwich Place
7,0-7,29.70617,-82.364035,Northwood,199357.0,1,Breakfast Spot,Yoga Studio,Garden Center,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
8,0-8,29.70617,-82.352045,Hazel Heights,176408.0,1,Sandwich Place,Convenience Store,Bakery,Park,Farmers Market,Liquor Store,Construction & Landscaping,Big Box Store,Video Game Store,Video Store
9,0-9,29.70617,-82.340055,Hazel Heights,176408.0,1,Breakfast Spot,Intersection,Business Service,Concert Hall,Sandwich Place,Fish & Chips Shop,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant
13,0-13,29.70617,-82.292094,,207069.0,1,Donut Shop,Yoga Studio,Fish & Chips Shop,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service


In [72]:
# Get all locations with at least one restaurant within cluster 1 that do not have a bakery in it.
potential_locations = cluster_one_restaurant_locations[False == cluster_one_restaurant_locations['Location'].isin(bakery_locations['Location'])]
print(potential_locations.shape)
potential_locations.head()

(64, 16)


Unnamed: 0,Location,Lat,Long,Neighborhood,Home Value,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,0-5,29.70617,-82.388015,Kensington Park,556897.0,1,Pizza Place,Pharmacy,Music Store,Salon / Barbershop,Food,Chinese Restaurant,Grocery Store,Seafood Restaurant,Bank,Sandwich Place
7,0-7,29.70617,-82.364035,Northwood,199357.0,1,Breakfast Spot,Yoga Studio,Garden Center,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
9,0-9,29.70617,-82.340055,Hazel Heights,176408.0,1,Breakfast Spot,Intersection,Business Service,Concert Hall,Sandwich Place,Fish & Chips Shop,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant
13,0-13,29.70617,-82.292094,,207069.0,1,Donut Shop,Yoga Studio,Fish & Chips Shop,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
25,1-9,29.695748,-82.340055,Hazel Heights,176408.0,1,Furniture / Home Store,Boutique,Coffee Shop,Business Service,Park,Automotive Shop,Yoga Studio,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant


In [79]:
# Finally, locate potential locations with homes above the median home value of Gainesville
potential_locations = potential_locations[potential_locations['Home Value'] > GAINESVILLE_MEDIAN_HOME_VALUE]
print(potential_locations.shape)
potential_locations.head()

(16, 16)


Unnamed: 0,Location,Lat,Long,Neighborhood,Home Value,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,0-5,29.70617,-82.388015,Kensington Park,556897.0,1,Pizza Place,Pharmacy,Music Store,Salon / Barbershop,Food,Chinese Restaurant,Grocery Store,Seafood Restaurant,Bank,Sandwich Place
36,2-4,29.685326,-82.400005,Kensington Park,556897.0,1,BBQ Joint,Insurance Office,Furniture / Home Store,Seafood Restaurant,Automotive Shop,Spa,Donut Shop,Dive Bar,Garden,Discount Store
39,2-7,29.685326,-82.364035,Royal Gardens,296000.0,1,Coffee Shop,Beer Garden,Pizza Place,Gym / Fitness Center,Yoga Studio,Fish & Chips Shop,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant
50,3-2,29.674903,-82.423986,West Hills,377014.0,1,Diner,American Restaurant,Fish & Chips Shop,Garden Center,Garden,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck
54,3-6,29.674903,-82.376025,Edgewood Hills,286751.0,1,Grocery Store,Martial Arts Dojo,Yoga Studio,Financial or Legal Service,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,French Restaurant,Food Truck,Food Service


In [80]:
# Create map of Gainesville to see our districts.
general_map = folium.Map(location=[GAINESVILLE_LATITUDE, GAINESVILLE_LONGITUDE], zoom_start=12)

# Add markers to the map for each districts.
for index, row in potential_locations.iterrows():
    folium.CircleMarker(
        [row['Lat'], row['Long']],
        radius=5,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(general_map)  
    
general_map

### Lastly, determine addresses of the potential locations

In [84]:
potential_addresses = []
for index, row in potential_locations.iterrows():
    location = geolocator.reverse((row['Lat'], row['Long']), exactly_one=True)
    potential_addresses.append(location.address)

In [86]:
for address in potential_addresses:
    print(address)

5651 NW 43rd St, Gainesville, FL 32653, USA
3506 NW 53rd Terrace, Gainesville, FL 32606, USA
3525 NW 28th Terrace, Gainesville, FL 32605, USA
2393 NW 77th Blvd, Gainesville, FL 32606, USA
2416 NW 35th Terrace, Gainesville, FL 32605, USA
1240 NW 76th Blvd, Gainesville, FL 32606, USA
1173 NW 64th Terrace, Gainesville, FL 32605, USA
1334 NE 8th St, Gainesville, FL 32601, USA
7540 W University Ave, Gainesville, FL 32607, USA
4343 W Newberry Rd, Gainesville, FL 32607, USA
325 NW 14th St, Gainesville, FL 32603, USA
804 NE 3rd Ave, Gainesville, FL 32601, USA
1322 Diamond Rd, Gainesville, FL 32601, USA
I-75, Gainesville, FL 32607, USA
2000 SW 13th St, Gainesville, FL 32608, USA
1905 FL-329, Gainesville, FL 32601, USA
