# Using Data Science for Strategic Location Identification: Opening a Cannabis Dispensary in Denver

## Table of Contents*
1. [Overview](#Overview)
2. [The Problem](#The-Problem)
3. [Business Understanding](#Business-Understanding)
4. [Analytic Approach](#Analytic-Approach)
5. [Data Requirements](#Data-Requirements)
6. [Data Collection](#Data-Collection)
7. [Data Preparation](#Data-Preparation)
8. [Data Understanding](#Data-Understanding)
9. [Modeling](#Modeling)
10. [Results](#Results)
11. [Discussion](#Discussion)
12. [References](#References)

*Anchor links may not function properly when viewing Jupyter Notebooks on Github

## Overview
A proprietor would like to open a new marijuana dispensary in Denver, Colorado. She is seeking advice on identifying strategic locations. She would like competition and market saturation to be at a minimum.

## The Problem
Denver, officially the City and County of Denver, is one of the fasted growing cities in the United States and is consistently rated as one of the best US cities to locate a new business. One of the unique aspects of Denver, and Colorado, is the legal status of cannabis. Retail sale of cannabis can be a profitable business, but with approximately 350 medical centers and recreational cannabis retailer establishments in Denver, most of the city already has convenient access to such facilities. If a proprietor wants to open a cannabis retail establishment in Denver, where should he or she be located? Data science methodology can be used to help find a solution to this problem.

## Business Understanding
"Every project, regardless of its size, starts with business understanding, which lays the foundation for successful resolution of the business problem. The business sponsors needing the analytic solution play the critical role in this stage by defining the problem, project objectives and solution requirements from a business perspective. And, believe it or not—even with nine stages still to go—this first stage is the hardest."<sup>[1]</sup>
  ### Defining the Problem
The problem that we are trying to solve here is stated best as a question: Considering the large number of cannabis retailers already present in Denver, what are the best neighborhoods in Denver to open a new operation?
#### Consideration Regarding ZIP Codes as an Alternative to Neighborhoods
<p>Geocoding areas of the city by ZIP code would seem to be a more efficient method of quering most geocoding APIs when compared to utilizing a string consisting of neighborhood, name, city, and state, as there is less room for noise when utilizing what is essentially an unambiguous unique geocodable number. Data gathering would be more easily accomplished and resulting data would be more reliable if ZIP Codes were analyzed instead of neighborhoods; however, there is a complication in this approach.</p>
<p>Because regulations, laws, and taxes almost always have a tangible impact on businesses, it is important that we limit comingling data with neighboring jurisdictions. Colorado laws and regulations vary substantially from county-to-county. Many ZIP Codes are not exclusive to one city or one county, and such is the case for Zip codes in Denver. Many are shared with neighborhing communities and counties. With the focus of this study being on the City and County of Denver specifically, ZIP Codes will not be utilized to mitigate the tainting of data with coordinate centroids outside of the city and county limits.</p>
<p>See the ZIP Code map in the Data Visualization section for further explanation.</p>
  ### Project Objectives
The objective of this project is to find the neighborhoods that offer the best opportunity for a proprietor to open a new retail cannabis operation.
  ### Solution Requirements
The problem will be solved when neighborhoods with the most potential are determined using real data. The best neighborhoods are defined as the neighborhoods that have the least number of cannabis stores relative to the total number of businesses.

## Analytic Approach
"After clearly stating a business problem, the data scientist can define the analytic approach to solving it. Doing so involves expressing the problem in the context of statistical and machine learning techniques so that the data scientist can identify techniques suitable for achieving the desired outcome."<sup>[1]</sup>
### Descriptive Approach
The analytic approach best fit for finding the solution to this problem is the descriptive approach. Each neighborhood within the City and County of Denver will be clustered based on it's relative saturation of retail/medical cannabis stores. The clusters will be grouped based on overall saturation of the neighborhood.

## Data Requirements
"Choice of analytic approach determines the data requirements, for the analytic methods to be used require particular data content, formats and representations, guided by domain knowledge."<sup>[1]</sup>

### Data Sources
#### City and County of Denver
##### List of Statistical Neighborhoods
Data from the City and County of Denver's website will be used to obtain a list of each statistical neighborhood within the City and County proper.  
URL: https://www.denvergov.org/media/gis/DataCatalog/statistical_neighborhoods/csv/statistical_neighborhoods.csv
##### Marijuana Active Business Licenses
Data from the City and County of Denver's website will be used to obtain a list of "Marijuana Active Business Licenses" within the City and County proper.  
URL: https://www.denvergov.org/media/gis/DataCatalog/marijuana_active_business_licenses/csv/marijuana_active_business_licenses.csv
#### ArcGIS Coordinate Data
Data from ArcGIS will be used to locate the geographic center of each neighborhood. The geographic center will be represented as coordinate points: (latitude, longitude)
#### GeocodeFarm Coordinate Data
Data from GeocodeFarm will be used to locate the geographic center of neighborhoods where ArcGIS data is incorrect. The geographic center will be represented as coordinate points: (latitude, longitude)
#### Foursquare
Data from Foursquare will be utilized in determining the cannabis stores relative frequency to overall operating businesses for each neighborhood in Denver. Returned venues regarding marijuana dispensaries appears to be incomplete, so they will be excluded from the utilized data set. Venue data regarding marijuana dispensaries will be obtained from the City and County of Denver.
Two API Endpoints will be used:  
      1. Venue Data: https://api.foursquare.com/v2/venues/explore  
      2. Categorical Hierarchy Data: https://api.foursquare.com/v2/venues/categories  

#### Wikipedia
The average radius of Earth is a necessary parameter when calculating the distance between two geographic coordinates. Wikipedia has this information available in the "Earth Radius" article.  
URL: https://en.wikipedia.org/wiki/Earth_radius

## Data Collection
"The data scientist identifies and gathers data resources—structured, unstructured and semi-structured—that are relevant to the problem domain. On encountering gaps in data collection, the data scientist might need to revise the data requirements and collect more data."<sup>[1]</sup>

### Import Necessary Python Libraries

In [1]:
# import pandas library
import pandas as pd
import numpy as np

### Retrieve Data from the City and County of Denver

#### Download the CSV file of Statistical Neighborhoods from the City and County of Denver Website and Verify Data

In [2]:
# download data as csv
!wget -O denver_statistical_neighborhoods.csv 'https://www.denvergov.org/media/gis/DataCatalog/statistical_neighborhoods/csv/statistical_neighborhoods.csv'
print('Download complete!')

--2019-01-10 16:55:14--  https://www.denvergov.org/media/gis/DataCatalog/statistical_neighborhoods/csv/statistical_neighborhoods.csv
Resolving www.denvergov.org... 169.133.239.100
Connecting to www.denvergov.org|169.133.239.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2050 (2.0K) [application/octet-stream]
Saving to: ‘denver_statistical_neighborhoods.csv’


2019-01-10 16:55:15 (75.2 MB/s) - ‘denver_statistical_neighborhoods.csv’ saved [2050/2050]

Download complete!


In [5]:
# import csv data to pandas DataFrame
statistical_neighborhoods_df = pd.read_csv('denver_statistical_neighborhoods.csv', header=0, sep=',').sort_values(by=['NBHD_NAME']).reset_index()
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (statistical_neighborhoods_df.shape[0], statistical_neighborhoods_df.shape[1]))
print(statistical_neighborhoods_df.head())

Rows: 78
Columns: 5
   index  NBHD_ID    NBHD_NAME TYPOLOGY NOTES
0     22        1  Athmar Park     None  None
1      0        2      Auraria     None  None
2     69        3        Baker     None  None
3     13        4       Barnum     None  None
4     14        5  Barnum West     None  None


#### Download the CSV file of "Marijuana Active Business Licenses" from the City and County of Denver Website and Verify Data

In [6]:
# download data as csv
!wget -O denver_marijuana_businesses.csv 'https://www.denvergov.org/media/gis/DataCatalog/marijuana_active_business_licenses/csv/marijuana_active_business_licenses.csv'
print('Download complete!')

--2019-01-10 16:56:25--  https://www.denvergov.org/media/gis/DataCatalog/marijuana_active_business_licenses/csv/marijuana_active_business_licenses.csv
Resolving www.denvergov.org... 169.133.239.100
Connecting to www.denvergov.org|169.133.239.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202988 (198K) [application/octet-stream]
Saving to: ‘denver_marijuana_businesses.csv’


2019-01-10 16:56:25 (2.86 MB/s) - ‘denver_marijuana_businesses.csv’ saved [202988/202988]

Download complete!


In [7]:
# import csv data to pandas DataFrame as dtype='object'
denver_marijuana_businesses_df = pd.read_csv('denver_marijuana_businesses.csv', header=0, sep=',').astype('object').sort_values(by=['Facility Zip Code'])
denver_marijuana_businesses_df.reset_index(drop=True, inplace=True)
# verify DataFrame contents
print('Rows: %i\nColumns: %i' % (denver_marijuana_businesses_df.shape[0], denver_marijuana_businesses_df.shape[1]))
denver_marijuana_businesses_df.head()

Rows: 1101
Columns: 13


Unnamed: 0,Business File Number,License Type,Entity Name,Trade Name,Current License Status,Expiration Date,Facility Street Number,Facility Pre-Direction,Facility Street Name,Facility Street Type,Facility Unit Number,Facility City,Facility Zip Code
0,2013-BFN-1068998,Retail Marij Opt. Prem. Cultiv,"MILE HIGH MEDICAL, LLC",MMJ AMERICA UPTOWN,License Issued - Active,9/4/2019 12:00:00 AM,1885,W,DARTMOUTH,AVE,6 & 8,DENVER,80110
1,2014-BFN-0003949,Retail Marijuana Store,"RJJ SHERIDAN, LLC",NATIVE ROOTS LITTLETON,License Issued - Active,1/21/2019 12:00:00 AM,7870,W,QUINCY,AVE,,DENVER,80123
2,2013-BFN-1068512,Medical Marijuana Center,"RJJ SHERIDAN, LLC",NATIVE ROOTS LITTLETON,License Issued - Active,10/3/2019 12:00:00 AM,7870,W,QUINCY,AVE,,DENVER,80123
3,2016-BFN-0004094,Retail Marijuana Store,"JVT ENTERPRISES, INC.",COLORADO CANNABIS CONNECTION,License Issued - Active,5/5/2019 12:00:00 AM,4550,S,KIPLING,ST,4,DENVER,80127
4,2014-BFN-1071718,Medical Marijuana Center,"JVT ENTERPRISES, INC.",COLORADO CANNABIS CONNECTION,License Issued - Active,7/8/2019 12:00:00 AM,4550,S,KIPLING,ST,4,DENVER,80127


In [8]:
# keep only rows where 'License Type' is 'Retail Marijuana Store' or 'Medical Marijuana Center'
license_types = ['Retail Marijuana Store', 'Medical Marijuana Center']
denver_marijuana_businesses_df = denver_marijuana_businesses_df[denver_marijuana_businesses_df['License Type'].isin(license_types)]
# replace NaN values with a blank string
denver_marijuana_businesses_df = denver_marijuana_businesses_df.replace(np.nan, '', regex=True)
# reset index
denver_marijuana_businesses_df.reset_index(drop=True, inplace=True)
# set datatype as object
denver_marijuana_businesses_df['Facility Street Number'] = denver_marijuana_businesses_df['Facility Street Number'].astype('object')
# return dataframe
denver_marijuana_businesses_df

Unnamed: 0,Business File Number,License Type,Entity Name,Trade Name,Current License Status,Expiration Date,Facility Street Number,Facility Pre-Direction,Facility Street Name,Facility Street Type,Facility Unit Number,Facility City,Facility Zip Code
0,2014-BFN-0003949,Retail Marijuana Store,"RJJ SHERIDAN, LLC",NATIVE ROOTS LITTLETON,License Issued - Active,1/21/2019 12:00:00 AM,7870,W,QUINCY,AVE,,DENVER,80123
1,2013-BFN-1068512,Medical Marijuana Center,"RJJ SHERIDAN, LLC",NATIVE ROOTS LITTLETON,License Issued - Active,10/3/2019 12:00:00 AM,7870,W,QUINCY,AVE,,DENVER,80123
2,2016-BFN-0004094,Retail Marijuana Store,"JVT ENTERPRISES, INC.",COLORADO CANNABIS CONNECTION,License Issued - Active,5/5/2019 12:00:00 AM,4550,S,KIPLING,ST,4,DENVER,80127
3,2014-BFN-1071718,Medical Marijuana Center,"JVT ENTERPRISES, INC.",COLORADO CANNABIS CONNECTION,License Issued - Active,7/8/2019 12:00:00 AM,4550,S,KIPLING,ST,4,DENVER,80127
4,2010-BFN-1045692,Medical Marijuana Center,"ALTERNATIVE MEDICINE ON THE MALL, LLC",NATIVE ROOTS APOTHECARY,License Issued - Active,3/12/2019 12:00:00 AM,910,,16TH,ST,805,DENVER,80202
5,2010-BFN-1045890,Medical Marijuana Center,LOTUS MEDICAL LLC,,License Issued - Active,12/2/2019 12:00:00 AM,1444,,WAZEE,ST,115,DENVER,80202
6,2013-BFN-1069174,Retail Marijuana Store,"DJR COLORADO, LLC",,License Issued - Active,2/10/2019 12:00:00 AM,1620,,Market,,5W,Denver,80202
7,2010-BFN-1045677,Medical Marijuana Center,1617 WAZEE STREET LLC,LODO WELLNESS CENTER,License Issued - Active,5/22/2019 12:00:00 AM,1617,,Wazee,,B,Denver,80202
8,2010-BFN-1045926,Medical Marijuana Center,"COMPASSIONATE CARE GIVERS, INC.",ROCKY MOUNTAIN HIGH,License Issued - Active,9/11/2019 12:00:00 AM,1538,,WAZEE,ST,100,DENVER,80202
9,2013-BFN-1069426,Retail Marijuana Store,"COMPASSIONATE CARE GIVERS, INC",ROCKY MOUNTAIN HIGH,License Issued - Active,3/24/2019 12:00:00 AM,1538,,WAZEE,ST,100,DENVER,80202


##### Save Data as CSV File

In [9]:
denver_marijuana_businesses_df.to_csv('denver_marijuana_businesses.csv', index=False, mode='w')
print('Export complete.')

Export complete.


### Obtain ArcGIS Coordinates

#### ArcGIS Coordinates for Neighborhood

In [10]:
# create blank dataframe for neighborhood data
hood_coordinates_df = pd.DataFrame(columns=['Neighborhood', 'ArcGIS_name', 'ArcGIS_Addr_Type', 'Latitude', 'Longitude'])
hood_coordinates_df

Unnamed: 0,Neighborhood,ArcGIS_name,ArcGIS_Addr_Type,Latitude,Longitude


In [11]:
# add neighborhoods to dataframe
hood_coordinates_df['Neighborhood'] = statistical_neighborhoods_df['NBHD_NAME']

In [12]:
# verify contents of DataFrame and shape
print('Rows: %i\nColumns: %i' % (hood_coordinates_df.shape[0], hood_coordinates_df.shape[1]))
print(hood_coordinates_df.head())

Rows: 78
Columns: 5
  Neighborhood ArcGIS_name ArcGIS_Addr_Type Latitude Longitude
0  Athmar Park         NaN              NaN      NaN       NaN
1      Auraria         NaN              NaN      NaN       NaN
2        Baker         NaN              NaN      NaN       NaN
3       Barnum         NaN              NaN      NaN       NaN
4  Barnum West         NaN              NaN      NaN       NaN


In [13]:
# expand neighborhood acronyms
hood_coordinates_df['Neighborhood'][hood_coordinates_df[hood_coordinates_df['Neighborhood'] == 'DIA'].index] = 'Denver International Airport'
print(hood_coordinates_df[hood_coordinates_df['Neighborhood'] == 'Denver International Airport'])
hood_coordinates_df['Neighborhood'][hood_coordinates_df[hood_coordinates_df['Neighborhood'] == 'CBD'].index] = 'Central Business District'
print(hood_coordinates_df[hood_coordinates_df['Neighborhood'] == 'Central Business District'])

                    Neighborhood ArcGIS_name ArcGIS_Addr_Type Latitude  \
22  Denver International Airport         NaN              NaN      NaN   

   Longitude  
22       NaN  
                Neighborhood ArcGIS_name ArcGIS_Addr_Type Latitude Longitude
8  Central Business District         NaN              NaN      NaN       NaN


In [14]:
# import geocoder library
import geocoder
# import sleep from time library
from time import sleep

In [15]:
# function for coordinate data
def get_coordinates(data_frame, column_number, city, state):
    denominator = data_frame.shape[0]
    for i, place in enumerate(data_frame.iloc[:, column_number]):
        # Pause to prevent request limit timeout
        if 'Zip Code' in data_frame:
            zip_code = data_frame['Zip Code'][i]
        else:
            zip_code = ''
        # Print progress
        print('Getting coordinates for %s: %i of %s' % (place, i + 1, denominator))
        # initialize your variable to None
        name = None
        addr_type = None
        lat_lng_coords = None

        # loop until you get the coordinates
        while(lat_lng_coords is None or lat_lng_coords is [None, None]):
            sleep(1)
            g = geocoder.arcgis('{}, {}, {}{}'.format(place, city, state, zip_code))
            print(g)
            name = g.raw['name']
            addr_type = g.raw['feature']['attributes']['Addr_Type']
            lat_lng_coords = [g.lat, g.lng]
        print('Address type: {}'.format(addr_type))
        print(lat_lng_coords)
        data_frame['ArcGIS_name'][i] = name
        data_frame['ArcGIS_Addr_Type'][i] = addr_type
        data_frame['Latitude'][i] = lat_lng_coords[0]
        data_frame['Longitude'][i] = lat_lng_coords[1]

In [16]:
city = 'Denver'
state = 'Colorado'
column_number = 0 # column containing place names for which to obtain coordinates

get_coordinates(hood_coordinates_df, column_number, city, state)
hood_coordinates_df.head()

Getting coordinates for Athmar Park: 1 of 78
<[OK] Arcgis - Geocode [Athmar Park, Denver, Colorado]>
Address type: Locality
[39.70396000000005, -105.01038999999997]
Getting coordinates for Auraria: 2 of 78
<[OK] Arcgis - Geocode [Auraria, Denver, Colorado]>
Address type: Locality
[39.74577000000005, -105.01001999999994]
Getting coordinates for Baker: 3 of 78
<[OK] Arcgis - Geocode [Baker, Denver, Colorado]>
Address type: Locality
[39.71117000000004, -104.99208999999996]
Getting coordinates for Barnum: 4 of 78
<[OK] Arcgis - Geocode [Barnum, Denver, Colorado]>
Address type: Locality
[39.71815000000004, -105.03308999999996]
Getting coordinates for Barnum West: 5 of 78
<[OK] Arcgis - Geocode [Barnum West, Denver, Colorado]>
Address type: Locality
[39.71815000000004, -105.04509999999999]
Getting coordinates for Bear Valley: 6 of 78
<[OK] Arcgis - Geocode [Bear Valley, Denver, Colorado]>
Address type: Locality
[39.66172000000006, -105.06560999999999]
Getting coordinates for Belcaro: 7 of 78

<[OK] Arcgis - Geocode [Platt Park, Denver, Colorado]>
Address type: Locality
[39.687580000000025, -104.98101999999994]
Getting coordinates for Regis: 52 of 78
<[OK] Arcgis - Geocode [Regis, Denver, Colorado]>
Address type: Locality
[39.787420000000054, -105.04098999999997]
Getting coordinates for Rosedale: 53 of 78
<[OK] Arcgis - Geocode [Rosedale, Denver, Colorado]>
Address type: Locality
[39.67491000000007, -104.98184999999995]
Getting coordinates for Ruby Hill: 54 of 78
<[OK] Arcgis - Geocode [Ruby Hill, Denver, Colorado]>
Address type: Locality
[39.69106000000005, -105.00873999999999]
Getting coordinates for Skyland: 55 of 78
<[OK] Arcgis - Geocode [Skyland, Denver, Colorado]>
Address type: Locality
[39.757580000000075, -104.94989999999996]
Getting coordinates for Sloan Lake: 56 of 78
<[OK] Arcgis - Geocode [Sloan Lake, Denver, Colorado]>
Address type: Locality
[39.752560000000074, -105.03812999999997]
Getting coordinates for South Park Hill: 57 of 78
<[OK] Arcgis - Geocode [South

Unnamed: 0,Neighborhood,ArcGIS_name,ArcGIS_Addr_Type,Latitude,Longitude
0,Athmar Park,"Athmar Park, Denver, Colorado",Locality,39.704,-105.01
1,Auraria,"Auraria, Denver, Colorado",Locality,39.7458,-105.01
2,Baker,"Baker, Denver, Colorado",Locality,39.7112,-104.992
3,Barnum,"Barnum, Denver, Colorado",Locality,39.7182,-105.033
4,Barnum West,"Barnum West, Denver, Colorado",Locality,39.7182,-105.045


In [17]:
# create blank dataframe for corrected neighborhood data
hood_coordinates_corrections_df = pd.DataFrame(columns=['Neighborhood', 'ArcGIS_name', 'ArcGIS_Addr_Type', 'Latitude', 'Longitude'])
hood_coordinates_corrections_df

Unnamed: 0,Neighborhood,ArcGIS_name,ArcGIS_Addr_Type,Latitude,Longitude


In [18]:
# Check for errors
hood_coordinates_corrections_df['Neighborhood'] = hood_coordinates_df['Neighborhood'][hood_coordinates_df['ArcGIS_Addr_Type'].str.contains('Locality') == False]
hood_coordinates_corrections_df['Neighborhood'].append(hood_coordinates_df['Neighborhood'][hood_coordinates_df['ArcGIS_name'].str.contains('Denver') == False])
hood_coordinates_corrections_df

Unnamed: 0,Neighborhood,ArcGIS_name,ArcGIS_Addr_Type,Latitude,Longitude
11,Cheesman Park,,,,
22,Denver International Airport,,,,
33,Harvey Park,,,,
37,Indian Creek,,,,
65,University Park,,,,
67,Villa Park,,,,


In [19]:
# concatenate string 'Neighborhood' on rows in 'Neighborhood' column of hood_coordinates_corrections_df
hood_coordinates_corrections_df = hood_coordinates_corrections_df.assign(Neighborhood = hood_coordinates_corrections_df['Neighborhood'] + ' Neighborhood')
hood_coordinates_corrections_df = hood_coordinates_corrections_df.reset_index(drop=True)

In [20]:
get_coordinates(hood_coordinates_corrections_df, 0, 'Denver', 'Colorado')
hood_coordinates_corrections_df.head()

Getting coordinates for Cheesman Park Neighborhood: 1 of 6
<[OK] Arcgis - Geocode [Cheeseman Park, Denver, Colorado]>
Address type: Locality
[39.73664000000008, -104.96720999999997]
Getting coordinates for Denver International Airport Neighborhood: 2 of 6
<[OK] Arcgis - Geocode [Denver International Airport, Denver, Colorado]>
Address type: Locality
[39.85138000000006, -104.68095999999997]
Getting coordinates for Harvey Park Neighborhood: 3 of 6
<[OK] Arcgis - Geocode [Harvey Park South, Denver, Colorado]>
Address type: Locality
[39.66250000000008, -105.04226999999997]
Getting coordinates for Indian Creek Neighborhood: 4 of 6
<[OK] Arcgis - Geocode [Colorado, Utah]>
Address type: Locality
[40.77911000000006, -111.92437999999999]
Getting coordinates for University Park Neighborhood: 5 of 6
<[OK] Arcgis - Geocode [University Park, Fort Collins, Colorado]>
Address type: Locality
[40.578050000000076, -105.07061999999996]
Getting coordinates for Villa Park Neighborhood: 6 of 6
<[OK] Arcgis 

Unnamed: 0,Neighborhood,ArcGIS_name,ArcGIS_Addr_Type,Latitude,Longitude
0,Cheesman Park Neighborhood,"Cheeseman Park, Denver, Colorado",Locality,39.7366,-104.967
1,Denver International Airport Neighborhood,"Denver International Airport, Denver, Colorado",Locality,39.8514,-104.681
2,Harvey Park Neighborhood,"Harvey Park South, Denver, Colorado",Locality,39.6625,-105.042
3,Indian Creek Neighborhood,"Colorado, Utah",Locality,40.7791,-111.924
4,University Park Neighborhood,"University Park, Fort Collins, Colorado",Locality,40.5781,-105.071


In [21]:
hood_coordinates_corrections_df['Neighborhood'] = hood_coordinates_corrections_df['Neighborhood'].str.replace(' Neighborhood','')

In [22]:
corrections_made = []
for i, hood in enumerate(hood_coordinates_df['Neighborhood']):
    for j, corrected in enumerate(hood_coordinates_corrections_df['Neighborhood']):
        if hood == corrected:
            corrections_made.append(i)
            hood_coordinates_df.values[i] = hood_coordinates_corrections_df.values[j]
hood_coordinates_df.loc[corrections_made]

Unnamed: 0,Neighborhood,ArcGIS_name,ArcGIS_Addr_Type,Latitude,Longitude
11,Cheesman Park,"Cheeseman Park, Denver, Colorado",Locality,39.7366,-104.967
22,Denver International Airport,"Denver International Airport, Denver, Colorado",Locality,39.8514,-104.681
33,Harvey Park,"Harvey Park South, Denver, Colorado",Locality,39.6625,-105.042
37,Indian Creek,"Colorado, Utah",Locality,40.7791,-111.924
65,University Park,"University Park, Fort Collins, Colorado",Locality,40.5781,-105.071
67,Villa Park,"Park Villas, Aurora, Colorado",Locality,39.6283,-104.826


In [23]:
# create blank dataframe for corrected neighborhood data
hood_coordinates_corrections2_df = pd.DataFrame(columns=['Neighborhood', 'name', 'accuracy', 'Latitude', 'Longitude'])
# Check for errors
hood_coordinates_corrections2_df['Neighborhood'] = hood_coordinates_df['Neighborhood'][hood_coordinates_df['ArcGIS_name'].str.contains('Denver') == False]
hood_coordinates_corrections2_df = hood_coordinates_corrections2_df.reset_index(drop=True)
hood_coordinates_corrections2_df

Unnamed: 0,Neighborhood,name,accuracy,Latitude,Longitude
0,Indian Creek,,,,
1,University Park,,,,
2,Villa Park,,,,


In [24]:
# function for coordinate data from GeocodeFarm
def get_coordinates_gf(data_frame, column_number, city, state):
    denominator = data_frame.shape[0]
    for i, place in enumerate(data_frame.iloc[:, column_number]):
        # Pause to prevent request limit timeout
        if 'Zip Code' in data_frame:
            zip_code = data_frame['Zip Code'][i]
        else:
            zip_code = ''
        # Print progress
        print('Getting coordinates for %s: %i of %s' % (place, i + 1, denominator))
        # initialize your variable to None
        name = None
        addr_type = None
        lat_lng_coords = None

        # loop until you get the coordinates
        while(lat_lng_coords is None or lat_lng_coords is [None, None]):
            sleep(1)
            g = geocoder.geocodefarm('{}, {}, {}{}'.format(place, city, state, zip_code))
            print(g)
            name = g.address
            accuracy = g.raw['accuracy']
            lat_lng_coords = [g.lat, g.lng]
        print('Accuracy: {}'.format(accuracy))
        print(lat_lng_coords)
        data_frame['name'][i] = name
        data_frame['accuracy'][i] = accuracy
        data_frame['Latitude'][i] = lat_lng_coords[0]
        data_frame['Longitude'][i] = lat_lng_coords[1]

In [25]:
get_coordinates_gf(hood_coordinates_corrections2_df, column_number, city, state)
hood_coordinates_corrections2_df

Getting coordinates for Indian Creek: 1 of 3
<[OK] Geocodefarm - Geocode [Indian Creek, CO, United States]>
Accuracy: EXACT_MATCH
[39.684734344465, -104.897361755171]
Getting coordinates for University Park: 2 of 3
<[OK] Geocodefarm - Geocode [University Park, CO, United States]>
Accuracy: EXACT_MATCH
[39.675952911365, -104.950141906171]
Getting coordinates for Villa Park: 3 of 3
<[OK] Geocodefarm - Geocode [Vicca Park, CO, United States]>
Accuracy: EXACT_MATCH
[39.731399536165, -105.039489746172]


Unnamed: 0,Neighborhood,name,accuracy,Latitude,Longitude
0,Indian Creek,"Indian Creek, CO, United States",EXACT_MATCH,39.6847,-104.897
1,University Park,"University Park, CO, United States",EXACT_MATCH,39.676,-104.95
2,Villa Park,"Vicca Park, CO, United States",EXACT_MATCH,39.7314,-105.039


In [26]:
corrections_made = []
for i, hood in enumerate(hood_coordinates_df['Neighborhood']):
    for j, corrected in enumerate(hood_coordinates_corrections2_df['Neighborhood']):
        if hood == corrected:
            corrections_made.append(i)
            hood_coordinates_df.values[i] = hood_coordinates_corrections2_df.values[j]
hood_coordinates_df.loc[corrections_made]

Unnamed: 0,Neighborhood,ArcGIS_name,ArcGIS_Addr_Type,Latitude,Longitude
37,Indian Creek,"Indian Creek, CO, United States",EXACT_MATCH,39.6847,-104.897
65,University Park,"University Park, CO, United States",EXACT_MATCH,39.676,-104.95
67,Villa Park,"Vicca Park, CO, United States",EXACT_MATCH,39.7314,-105.039


In [27]:
hood_coordinates_df = pd.DataFrame(hood_coordinates_df[['Neighborhood', 'Latitude', 'Longitude']])
hood_coordinates_df

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Athmar Park,39.704,-105.01
1,Auraria,39.7458,-105.01
2,Baker,39.7112,-104.992
3,Barnum,39.7182,-105.033
4,Barnum West,39.7182,-105.045
5,Bear Valley,39.6617,-105.066
6,Belcaro,39.7038,-104.95
7,Berkeley,39.7767,-105.04
8,Central Business District,39.7437,-104.991
9,Capitol Hill,39.7337,-104.979


##### Save Data as CSV File

In [28]:
hood_coordinates_df.to_csv('denver_neighborhood_coords.csv', index=False, mode='w')
print('Export complete.')

Export complete.


#### ArcGIS Coordinates for Marijuana Active Business Licenses

In [29]:
# create blank dataframe for business data
cannabis_coordinates_df = pd.DataFrame(columns=['Business File Number', 'Address', 'Zip Code', 'Latitude', 'Longitude', 'ArcGIS_name', 'ArcGIS_Addr_Type']).astype('object')
cannabis_coordinates_df

Unnamed: 0,Business File Number,Address,Zip Code,Latitude,Longitude,ArcGIS_name,ArcGIS_Addr_Type


In [30]:
# add business data to dataframe
cannabis_coordinates_df['Business File Number'] = denver_marijuana_businesses_df['Business File Number']
cannabis_coordinates_df['Address'] = denver_marijuana_businesses_df['Facility Street Number'].astype(int).astype(str) + ' ' + denver_marijuana_businesses_df['Facility Pre-Direction'] + ' ' + denver_marijuana_businesses_df['Facility Street Name'] + ' ' + denver_marijuana_businesses_df['Facility Street Type']
cannabis_coordinates_df['Zip Code'] = denver_marijuana_businesses_df['Facility Zip Code']


In [31]:
# verify contents of DataFrame and shape
print('Rows: %i\nColumns: %i' % (cannabis_coordinates_df.shape[0], cannabis_coordinates_df.shape[1]))
cannabis_coordinates_df.reset_index(drop=True)
cannabis_coordinates_df.head()

Rows: 351
Columns: 7


Unnamed: 0,Business File Number,Address,Zip Code,Latitude,Longitude,ArcGIS_name,ArcGIS_Addr_Type
0,2014-BFN-0003949,7870 W QUINCY AVE,80123,,,,
1,2013-BFN-1068512,7870 W QUINCY AVE,80123,,,,
2,2016-BFN-0004094,4550 S KIPLING ST,80127,,,,
3,2014-BFN-1071718,4550 S KIPLING ST,80127,,,,
4,2010-BFN-1045692,910 16TH ST,80202,,,,


In [32]:
# retrieve coordinate data from ArcGIS
city = 'Denver'
state = 'Colorado'
column_number = 1
data_frame = cannabis_coordinates_df

get_coordinates(data_frame, column_number, city, state)
data_frame.head()

Getting coordinates for 7870 W QUINCY AVE: 1 of 351
<[OK] Arcgis - Geocode [7870 W Quincy Ave, Littleton, Colorado, 80123]>
Address type: PointAddress
[39.638685470136956, -105.084801]
Getting coordinates for 7870 W QUINCY AVE: 2 of 351
<[OK] Arcgis - Geocode [7870 W Quincy Ave, Littleton, Colorado, 80123]>
Address type: PointAddress
[39.638685470136956, -105.084801]
Getting coordinates for 4550 S KIPLING ST: 3 of 351
<[OK] Arcgis - Geocode [4550 S Kipling St, Littleton, Colorado, 80127]>
Address type: PointAddress
[39.63370499999999, -105.10977829078479]
Getting coordinates for 4550 S KIPLING ST: 4 of 351
<[OK] Arcgis - Geocode [4550 S Kipling St, Littleton, Colorado, 80127]>
Address type: PointAddress
[39.63370499999999, -105.10977829078479]
Getting coordinates for 910  16TH ST: 5 of 351
<[OK] Arcgis - Geocode [910 16th St, Denver, Colorado, 80202]>
Address type: PointAddress
[39.74671522424844, -104.99422949945173]
Getting coordinates for 1444  WAZEE ST: 6 of 351
<[OK] Arcgis - Geoc

<[OK] Arcgis - Geocode [762 Kalamath St, Denver, Colorado, 80204]>
Address type: StreetAddress
[39.72809337623714, -105.00018039060937]
Getting coordinates for 1630 N FEDERAL BLVD: 47 of 351
<[OK] Arcgis - Geocode [1630 Federal Blvd, Denver, Colorado, 80204]>
Address type: PointAddress
[39.742874413266776, -105.02521430694227]
Getting coordinates for 990 W 6TH AVE: 48 of 351
<[OK] Arcgis - Geocode [990 W 6th Ave, Denver, Colorado, 80204]>
Address type: StreetAddress
[39.72561068926006, -105.00001153785202]
Getting coordinates for 419 W 13TH AVE: 49 of 351
<[OK] Arcgis - Geocode [419 W 13th Ave, Denver, Colorado, 80204]>
Address type: PointAddress
[39.73688746378898, -104.99343258090484]
Getting coordinates for 445 N FEDERAL BLVD: 50 of 351
<[OK] Arcgis - Geocode [445 Federal Blvd, Denver, Colorado, 80204]>
Address type: PointAddress
[39.72360611518263, -105.02521618526744]
Getting coordinates for 777 N CANOSA CT: 51 of 351
<[OK] Arcgis - Geocode [777 Canosa Ct, Denver, Colorado, 80204]

<[OK] Arcgis - Geocode [3955 Oneida St, Denver, Colorado, 80207]>
Address type: PointAddress
[39.771989276992315, -104.90771726576824]
Getting coordinates for 3835 N Elm ST: 91 of 351
<[OK] Arcgis - Geocode [3835 Elm St, Denver, Colorado, 80207]>
Address type: PointAddress
[39.770027999999996, -104.92935871859294]
Getting coordinates for 3950 N HOLLY ST: 92 of 351
<[OK] Arcgis - Geocode [3950 Holly St, Denver, Colorado, 80207]>
Address type: PointAddress
[39.772224723007696, -104.92228420869652]
Getting coordinates for 399 S HARRISON ST: 93 of 351
<[OK] Arcgis - Geocode [399 S Harrison St, Denver, Colorado, 80209]>
Address type: PointAddress
[39.70943027699232, -104.94137776245003]
Getting coordinates for 432 S BROADWAY : 94 of 351
<[OK] Arcgis - Geocode [432 S Broadway, Denver, Colorado, 80209]>
Address type: PointAddress
[39.70869300000001, -104.98757574534025]
Getting coordinates for 135 S BROADWAY : 95 of 351
<[OK] Arcgis - Geocode [135 S Broadway, Denver, Colorado, 80209]>
Address

<[OK] Arcgis - Geocode [1881 S Broadway, Denver, Colorado, 80210]>
Address type: PointAddress
[39.682458, -104.9876167146978]
Getting coordinates for 1724 S BROADWAY : 135 of 351
<[OK] Arcgis - Geocode [1724 S Broadway, Denver, Colorado, 80210]>
Address type: PointAddress
[39.685175580904854, -104.98750777029855]
Getting coordinates for 1881 S BROADWAY : 136 of 351
<[OK] Arcgis - Geocode [1881 S Broadway, Denver, Colorado, 80210]>
Address type: PointAddress
[39.682458, -104.9876167146978]
Getting coordinates for 2209 W 32ND AVE: 137 of 351
<[OK] Arcgis - Geocode [2209 W 32nd Ave, Denver, Colorado, 80211]>
Address type: PointAddress
[39.762037486005966, -105.013152]
Getting coordinates for 2707 W 38TH AVE: 138 of 351
<[OK] Arcgis - Geocode [2707 W 38th Ave, Denver, Colorado, 80211]>
Address type: PointAddress
[39.769295489324186, -105.02084199708581]
Getting coordinates for 2209 W 32ND AVE: 139 of 351
<[OK] Arcgis - Geocode [2209 W 32nd Ave, Denver, Colorado, 80211]>
Address type: Point

<[OK] Arcgis - Geocode [4935 York St, Denver, Colorado, 80216]>
Address type: PointAddress
[39.7855262514571, -104.95933825855488]
Getting coordinates for 3450 N BRIGHTON BLVD: 179 of 351
<[OK] Arcgis - Geocode [3450 Brighton Blvd, Denver, Colorado, 80216]>
Address type: PointAddress
[39.77018334802941, -104.97900419412558]
Getting coordinates for 4401 E 46TH AVE: 180 of 351
<[OK] Arcgis - Geocode [4401 E 46th Ave, Denver, Colorado, 80216]>
Address type: StreetAddress
[39.780542369628364, -104.93519613037165]
Getting coordinates for 5110 N RACE ST: 181 of 351
<[OK] Arcgis - Geocode [5110 Race St, Denver, Colorado, 80216]>
Address type: PointAddress
[39.78828899999999, -104.96347627751192]
Getting coordinates for 4501 N ADAMS ST: 182 of 351
<[OK] Arcgis - Geocode [4501 Adams St, Denver, Colorado, 80216]>
Address type: PointAddress
[39.77904642936695, -104.94864899267107]
Getting coordinates for 4095 N JACKSON ST: 183 of 351
<[OK] Arcgis - Geocode [4095 Jackson St, Denver, Colorado, 8021

<[OK] Arcgis - Geocode [1568 S Federal Blvd, Denver, Colorado, 80219]>
Address type: PointAddress
[39.68832536081135, -105.02503679857507]
Getting coordinates for 5109 W ALAMEDA AVE: 223 of 351
<[OK] Arcgis - Geocode [5109 W Alameda Ave, Denver, Colorado, 80219]>
Address type: PointAddress
[39.71141449711446, -105.052401]
Getting coordinates for 330 N FEDERAL BLVD: 224 of 351
<[OK] Arcgis - Geocode [330 Federal Blvd, Denver, Colorado, 80219]>
Address type: PointAddress
[39.72170700000001, -105.02503478472528]
Getting coordinates for 2601 W ALAMEDA AVE: 225 of 351
<[OK] Arcgis - Geocode [2601 W Alameda Ave, Denver, Colorado, 80219]>
Address type: PointAddress
[39.71120347489747, -105.01821]
Getting coordinates for 755 S Federal BLVD: 226 of 351
<[OK] Arcgis - Geocode [755 S Federal Blvd, Denver, Colorado, 80219]>
Address type: PointAddress
[39.70285200000001, -105.02511925855488]
Getting coordinates for 2426 S FEDERAL BLVD: 227 of 351
<[OK] Arcgis - Geocode [2426 S Federal Blvd, Denver,

<[OK] Arcgis - Geocode [150 Rio Grande Blvd, Denver, Colorado, 80223]>
Address type: PointAddress
[39.718809375814985, -105.00403645389255]
Getting coordinates for 2042 S BANNOCK ST: 267 of 351
<[OK] Arcgis - Geocode [2042 S Bannock St, Denver, Colorado, 80223]>
Address type: PointAddress
[39.679406832361934, -104.98990678862042]
Getting coordinates for 930 W BYERS PL: 268 of 351
<[OK] Arcgis - Geocode [930 W Byers Pl, Denver, Colorado, 80223]>
Address type: PointAddress
[39.711976293900555, -104.99843334442272]
Getting coordinates for 1941 W Evans : 269 of 351
<[OK] Arcgis - Geocode [1941 W Evans Ave, Denver, Colorado, 80223]>
Address type: PointAddress
[39.678672497114476, -105.0103797485429]
Getting coordinates for 1178 S KALAMATH ST: 270 of 351
<[OK] Arcgis - Geocode [1178 S Kalamath St, Denver, Colorado, 80223]>
Address type: PointAddress
[39.69531883236195, -105.00028532194592]
Getting coordinates for 2490 W 2ND AVE: 271 of 351
<[OK] Arcgis - Geocode [2490 W 2nd Ave, Denver, Colo

<[OK] Arcgis - Geocode [970 S Oneida St, Denver, Colorado, 80224]>
Address type: PointAddress
[39.69888296863638, -104.90817779136172]
Getting coordinates for 9206 E HAMPDEN AVE: 311 of 351
<[OK] Arcgis - Geocode [9206 E Hampden Ave, Denver, Colorado, 80231]>
Address type: StreetAddress
[39.652980767084316, -104.88046013654566]
Getting coordinates for 3435 S YOSEMITE ST: 312 of 351
<[OK] Arcgis - Geocode [3435 S Yosemite St, Denver, Colorado, 80231]>
Address type: PointAddress
[39.65435969608748, -104.88488961587512]
Getting coordinates for 9206 E HAMPDEN AVE: 313 of 351
<[OK] Arcgis - Geocode [9206 E Hampden Ave, Denver, Colorado, 80231]>
Address type: StreetAddress
[39.652980767084316, -104.88046013654566]
Getting coordinates for 3435 S YOSEMITE ST: 314 of 351
<[OK] Arcgis - Geocode [3435 S Yosemite St, Denver, Colorado, 80231]>
Address type: PointAddress
[39.65435969608748, -104.88488961587512]
Getting coordinates for 3480 S GALENA ST: 315 of 351
<[OK] Arcgis - Geocode [3480 S Galen

Unnamed: 0,Business File Number,Address,Zip Code,Latitude,Longitude,ArcGIS_name,ArcGIS_Addr_Type
0,2014-BFN-0003949,7870 W QUINCY AVE,80123,39.6387,-105.085,"7870 W Quincy Ave, Littleton, Colorado, 80123",PointAddress
1,2013-BFN-1068512,7870 W QUINCY AVE,80123,39.6387,-105.085,"7870 W Quincy Ave, Littleton, Colorado, 80123",PointAddress
2,2016-BFN-0004094,4550 S KIPLING ST,80127,39.6337,-105.11,"4550 S Kipling St, Littleton, Colorado, 80127",PointAddress
3,2014-BFN-1071718,4550 S KIPLING ST,80127,39.6337,-105.11,"4550 S Kipling St, Littleton, Colorado, 80127",PointAddress
4,2010-BFN-1045692,910 16TH ST,80202,39.7467,-104.994,"910 16th St, Denver, Colorado, 80202",PointAddress


In [34]:
# verify data integrity
acceptable_types = ['PointAddress', 'StreetAddress']
coordinate_list = cannabis_coordinates_df[cannabis_coordinates_df['ArcGIS_Addr_Type'].isin(acceptable_types)].shape
print('Rows: %i\nColumns: %i' % (coordinate_list[0], coordinate_list[1]))
cannabis_coordinates_df.head()

Rows: 350
Columns: 7


Unnamed: 0,Business File Number,Address,Zip Code,Latitude,Longitude,ArcGIS_name,ArcGIS_Addr_Type
0,2014-BFN-0003949,7870 W QUINCY AVE,80123,39.6387,-105.085,"7870 W Quincy Ave, Littleton, Colorado, 80123",PointAddress
1,2013-BFN-1068512,7870 W QUINCY AVE,80123,39.6387,-105.085,"7870 W Quincy Ave, Littleton, Colorado, 80123",PointAddress
2,2016-BFN-0004094,4550 S KIPLING ST,80127,39.6337,-105.11,"4550 S Kipling St, Littleton, Colorado, 80127",PointAddress
3,2014-BFN-1071718,4550 S KIPLING ST,80127,39.6337,-105.11,"4550 S Kipling St, Littleton, Colorado, 80127",PointAddress
4,2010-BFN-1045692,910 16TH ST,80202,39.7467,-104.994,"910 16th St, Denver, Colorado, 80202",PointAddress


In [35]:
# remove unneeded columns
cannabis_coordinates_df = cannabis_coordinates_df.drop(columns=['Address', 'Zip Code', 'ArcGIS_name', 'ArcGIS_Addr_Type'])
cannabis_coordinates_df.head()

Unnamed: 0,Business File Number,Latitude,Longitude
0,2014-BFN-0003949,39.6387,-105.085
1,2013-BFN-1068512,39.6387,-105.085
2,2016-BFN-0004094,39.6337,-105.11
3,2014-BFN-1071718,39.6337,-105.11
4,2010-BFN-1045692,39.7467,-104.994


In [36]:
# export data as CSV
cannabis_coordinates_df.to_csv('denver_cannabis_coords.csv', index=False, mode='w')
print('Export complete.')

Export complete.


### Obtain List of Venues for Each Neighborhood

#### Import the Needed Libraries

In [37]:
import time # used to insert a pause to prevent timeout from request rate
import requests # library to handle requests

#### Set Foursquare API Client Credentials & Parameters
Credentials for calling the Foursquare API are private, so the variables CLIENT_ID and CLIENT_SECRET are represented as 'XXX' below; however, the actual calls utilized both variables set the to the correct corresponding API credentials.

In [161]:
CLIENT_ID = 'XXX' # your Foursquare ID
CLIENT_SECRET = 'XXX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

LIMIT = 100 # maximum number of venues to return
RADIUS = 1609 # ~ 1 mile

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XXX
CLIENT_SECRET:XXX


#### Create a function to request nearby venues
A radius of approximately 1 mile (1609m) from the neighborhood coordinate point is being used to find nearby venues.

In [39]:
def getNearbyVenues(names, latitudes, longitudes, radius=RADIUS):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        time.sleep(1)
        print("Getting venues for {}".format(name))
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [40]:
denver_venues = getNearbyVenues(names=hood_coordinates_df['Neighborhood'],
                                   latitudes=hood_coordinates_df['Latitude'],
                                   longitudes=hood_coordinates_df['Longitude']
                                  )

Getting venues for Athmar Park
Getting venues for Auraria
Getting venues for Baker
Getting venues for Barnum
Getting venues for Barnum West
Getting venues for Bear Valley
Getting venues for Belcaro
Getting venues for Berkeley
Getting venues for Central Business District
Getting venues for Capitol Hill
Getting venues for Chaffee Park
Getting venues for Cheesman Park
Getting venues for Cherry Creek
Getting venues for City Park
Getting venues for City Park West
Getting venues for Civic Center
Getting venues for Clayton
Getting venues for Cole
Getting venues for College View - South Platte
Getting venues for Congress Park
Getting venues for Cory - Merrill
Getting venues for Country Club
Getting venues for Denver International Airport
Getting venues for East Colfax
Getting venues for Elyria Swansea
Getting venues for Five Points
Getting venues for Fort Logan
Getting venues for Gateway - Green Valley Ranch
Getting venues for Globeville
Getting venues for Goldsmith
Getting venues for Hale
Get

In [41]:
print(denver_venues.shape)
denver_venues

(5696, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Athmar Park,39.70396,-105.01039,Super Star Asian Cuisine,39.710007,-105.013715,Dim Sum Restaurant
1,Athmar Park,39.70396,-105.01039,Vinh Xuong Bakery (2),39.710609,-105.015232,Vietnamese Restaurant
2,Athmar Park,39.70396,-105.01039,Chain Reaction Brewery,39.699577,-105.001335,Brewery
3,Athmar Park,39.70396,-105.01039,The Green Solution - Alameda Ave @ West Denver...,39.711530,-105.018290,Marijuana Dispensary
4,Athmar Park,39.70396,-105.01039,Costco Wholesale,39.708594,-105.014280,Department Store
5,Athmar Park,39.70396,-105.01039,New Saigon,39.704848,-105.024811,Vietnamese Restaurant
6,Athmar Park,39.70396,-105.01039,Pho Duy,39.699758,-105.025382,Vietnamese Restaurant
7,Athmar Park,39.70396,-105.01039,Stranahan's Colorado Whiskey,39.712253,-104.998576,Distillery
8,Athmar Park,39.70396,-105.01039,Pacific Ocean International Supermarket,39.710408,-105.013612,Supermarket
9,Athmar Park,39.70396,-105.01039,Level 7 Games,39.711467,-105.015581,Video Game Store


In [42]:
denver_venues.to_csv('denver_neighborhood_venues.csv', index=False, mode='w')
print('Export complete.')

Export complete.


#### Create a Function to Return FourSquare Categorical Hierarchy
Hierarchy will be reduced to two levels: top_category, child_category. This data will be used in a future iteration of the study.

In [43]:
def getCategoriesHierarchy():
    cat_list=[]
    url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            )
    # make the GET request
    results = requests.get(url).json()['response']
    for i, top_h in enumerate(results['categories']):
        h1 = top_h['name']
        if len(results['categories'][i]['categories']) > 0:
            cat_list.append([(h1, h1)])
            for j, second_h in enumerate(results['categories'][i]['categories']):
                h2 = second_h['name']
                if len(results['categories'][i]['categories'][j]['categories']) > 0:
                    cat_list.append([(h1, h2)])
                    for k, third_h in enumerate(results['categories'][i]['categories'][j]['categories']):
                        h3 = third_h['name']
                        if len(results['categories'][i]['categories'][j]['categories'][k]['categories']):
                            cat_list.append([(h1, h3)])
                            for l, fourth_h in enumerate(results['categories'][i]['categories'][j]['categories'][k]['categories']):
                                h4 = fourth_h['name']
                                if len(results['categories'][i]['categories'][j]['categories'][k]['categories'][l]['categories']) > 0:
                                    cat_list.append([(h1, h4)])
                                    for m, fifth_h in enumerate(results['categories'][i]['categories'][j]['categories'][k]['categories'][l]['categories']):
                                        h5 = fifth_h['name']
                                        cat_list.append([(h1, h5)])
                                        if len(results['categories'][i]['categories'][j]['categories'][k]['categories'][l]['categories'][m]['categories']) > 0:
                                            for n, sixth_h in enumerate(results['categories'][i]['categories'][j]['categories'][k]['categories'][l]['categories'][m]['categories']):
                                                h6 = sixth_h['name']
                                                cat_list.append([(h1, h6)])
                                        else:
                                            cat_list.append([(h1, h5)])
                                else:
                                    cat_list.append([(h1, h4)])
                        else:
                            cat_list.append([(h1, h3)])
                else:
                    cat_list.append([(h1, h2)])
        else:
            cat_list.append([(h1, h1)])
    cat_data_frame = pd.DataFrame(cat for cat_list in cat_list for cat in cat_list for cat in cat_list for cat in cat_list)
    cat_data_frame.columns = ['top_category', 'child_category']
    return cat_data_frame

In [44]:
fs_cat_df = getCategoriesHierarchy()
fs_cat_df.shape

(950, 2)

#### Add Venue Categories Hieghest Level of Hierarchy as Column
This reduces ambiguity in regards to type of venue. It improves entropy.

In [45]:
denver_venues['Venue Top Category'] = denver_venues['Venue Category']
denver_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Top Category
0,Athmar Park,39.70396,-105.01039,Super Star Asian Cuisine,39.710007,-105.013715,Dim Sum Restaurant,Dim Sum Restaurant
1,Athmar Park,39.70396,-105.01039,Vinh Xuong Bakery (2),39.710609,-105.015232,Vietnamese Restaurant,Vietnamese Restaurant
2,Athmar Park,39.70396,-105.01039,Chain Reaction Brewery,39.699577,-105.001335,Brewery,Brewery
3,Athmar Park,39.70396,-105.01039,The Green Solution - Alameda Ave @ West Denver...,39.71153,-105.01829,Marijuana Dispensary,Marijuana Dispensary
4,Athmar Park,39.70396,-105.01039,Costco Wholesale,39.708594,-105.01428,Department Store,Department Store


In [46]:
for i, cat in enumerate(fs_cat_df['child_category']):
    denver_venues['Venue Top Category'].replace(to_replace=cat, value=fs_cat_df['top_category'][i], inplace=True)
denver_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Top Category
0,Athmar Park,39.70396,-105.01039,Super Star Asian Cuisine,39.710007,-105.013715,Dim Sum Restaurant,Food
1,Athmar Park,39.70396,-105.01039,Vinh Xuong Bakery (2),39.710609,-105.015232,Vietnamese Restaurant,Food
2,Athmar Park,39.70396,-105.01039,Chain Reaction Brewery,39.699577,-105.001335,Brewery,Nightlife Spot
3,Athmar Park,39.70396,-105.01039,The Green Solution - Alameda Ave @ West Denver...,39.71153,-105.01829,Marijuana Dispensary,Shop & Service
4,Athmar Park,39.70396,-105.01039,Costco Wholesale,39.708594,-105.01428,Department Store,Shop & Service


In [47]:
denver_venues.to_csv('denver_neighborhood_venues_top.csv', index=False, mode='w')
print('Export complete.')

Export complete.


## Data Understanding
"Descriptive statistics and visualization techniques can help a data scientist understand data content, assess data quality and discover initial insights into the data. A revisiting of the previous step, data collection, might be necessary to close gaps in understanding."<sup>[1]</sup>

### Visualize ZIP Codes in the City and County of Denver
ZIP Code data was considered as an alternative to neighborhood data due to aforementioned issues. An example of a multi-jurisdictional ZIP Code that could taint data is 80022, which covers a very small portion of Denver's far northeast. A depiction of the various ZIP Codes that are multi-jurisdictional in Metro Denver is useful in understanding this issue. Denver's border is depicted with the bold black lines surrounding the bold-font Denver label. The following image was retrieved from ZipMap.net:  
<center>
![Map of Denver Metro Zip Codes](denver_metro_zip_map.png "Map of Metro Denver Zip Codes")

(https://www.zipmap.net/Colorado/Denver_County/Denver.htm, retrieved 12/20/2018)
</center>

### Visualize Neighborhoods in the City and County of Denver

#### Prepare Neighborhood Data for Map Vizualization

In [88]:
# import csv data to pandas DataFrame
hood_coordinates_df = pd.read_csv('denver_neighborhood_coords.csv', header=0, sep=',').sort_values(by=['Neighborhood']).reset_index()
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (hood_coordinates_df.shape[0], hood_coordinates_df.shape[1]))
print(hood_coordinates_df.head())

Rows: 78
Columns: 4
   index Neighborhood  Latitude  Longitude
0      0  Athmar Park  39.70396 -105.01039
1      1      Auraria  39.74577 -105.01002
2      2        Baker  39.71117 -104.99209
3      3       Barnum  39.71815 -105.03309
4      4  Barnum West  39.71815 -105.04510


#### Create Map

In [89]:
# import Nominatim from geopy Library

In [90]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [91]:
# function for gettting coordinates for city coordinates
def getCityCoordinates(city, state):
    address = '{}, {}'.format(city, state)
    geolocator = Nominatim(user_agent="battle-of-neighborhoods-project")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return(latitude, longitude)

In [92]:
city = 'Denver'
state = 'CO'
latitude, longitude = getCityCoordinates(city, state)
print('The geograpical coordinates of Denver are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Denver are 39.7392364, -104.9848623.


In [93]:
# import folium library
import folium

In [94]:
# create map of Denver using latitude and longitude values
map_denver = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(hood_coordinates_df['Latitude'], hood_coordinates_df['Longitude'], hood_coordinates_df['Neighborhood']):
    label = 'Neighborhood: {}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_denver)  
    
map_denver

#### Map Interpretation
It is important that each neighborhood is plotting to the correct location. The City and County of Denver does not have precise coordinate definitions available for each neighborhood, so data from ArcGIS and GeocodeFarm is being used for the neighborhood center. A visual comparison of the above map and a map of the different neighborhoods available on the City and County of Denver website varifies that the coordinates are accurate for each neighborhood.

### Visualize Marijuana Active Business Licenses in the City and County of Denver

#### Prepare Marijuana Active Business Licenses for Map Vizualization

In [95]:
# import csv data to pandas DataFrame
cannabis_cords_df = pd.read_csv('denver_cannabis_coords.csv', header=0, sep=',').reset_index()
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (cannabis_cords_df.shape[0], cannabis_cords_df.shape[1]))
cannabis_cords_df.head()

Rows: 351
Columns: 4


Unnamed: 0,index,Business File Number,Latitude,Longitude
0,0,2014-BFN-0003949,39.638685,-105.084801
1,1,2013-BFN-1068512,39.638685,-105.084801
2,2,2016-BFN-0004094,39.633705,-105.109778
3,3,2014-BFN-1071718,39.633705,-105.109778
4,4,2010-BFN-1045692,39.746715,-104.994229


#### Create Map

In [96]:
# create map of Denver using latitude and longitude values
map_denver = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, store in zip(cannabis_cords_df['Latitude'], cannabis_cords_df['Longitude'], cannabis_cords_df['Business File Number']):
    label = 'Location: {}'.format(store)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_denver)  
    
map_denver

#### Map Interpretation
Retail Marijuana Stores and Medical Marijuana Centers appear to be concentrated primarily along main thoroughfairs, especially the following routes: Federal Boulevard(US 287), Colfax Avenue(US 40), Broadway, and Interstate 25(US 6). While the stores and centers appear to be concentrated along the aforementioned routes and are thus not evenly distributed throughout the city, these routes are adjacent to numerous neighborhoods and at least one of the routes are undoubtedly utilized by many, if not most, of the cities residents and visitors in a given day. Most areas of the city have access to a Retail Marijuana Store or Medical Marijuana Center within walking distance or a short drive.

## Data Preparation
"The data preparation stage comprises all activities used to construct the data set that will be used in the modeling stage. These include data cleaning, combining data from multiple sources and transforming data into more useful variables. Moreover, feature engineering and text analytics may be used to derive new structured variables, enriching the set of predictors and enhancing the model’s accuracy.
The data preparation stage is the most time-consuming. Although I have seen it account for 90 percent of overall project time, that figure is usually more on the order of 70 percent. However, it can drop as low as 50 percent if data resources are well managed, well integrated and clean from an analytical—not merely a warehousing—perspective. And automating some steps of data preparation may reduce the percentage even farther: Members of a telecommunications marketing team once told me that the team had reduced the average time required to create and deploy promotions from three months to three weeks in just this way."<sup>[1]</sup>

### Import List of Statistical Neighborhood Coordinates as Pandas DataFrame

#### Create Pandas DataFrame from CSV File

In [98]:
# import csv data to pandas DataFrame
hood_coordinates_df = pd.read_csv('denver_neighborhood_coords.csv', header=0, sep=',').sort_values(by=['Neighborhood']).reset_index()
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (hood_coordinates_df.shape[0], hood_coordinates_df.shape[1]))
hood_coordinates_df.head()

Rows: 78
Columns: 4


Unnamed: 0,index,Neighborhood,Latitude,Longitude
0,0,Athmar Park,39.70396,-105.01039
1,1,Auraria,39.74577,-105.01002
2,2,Baker,39.71117,-104.99209
3,3,Barnum,39.71815,-105.03309
4,4,Barnum West,39.71815,-105.0451


### Generate Pandas DataFrame of Marijuana Active Business Licenses Coordinates and License Type

#### Import Marijuana Active Business Licenses Coordinates CSV

In [99]:
# import csv data to pandas DataFrame
marijuana_business_coords = pd.read_csv('denver_cannabis_coords.csv', header=0, sep=',')
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (marijuana_business_coords.shape[0], marijuana_business_coords.shape[1]))

Rows: 351
Columns: 3


In [100]:
marijuana_business_coords.head()

Unnamed: 0,Business File Number,Latitude,Longitude
0,2014-BFN-0003949,39.638685,-105.084801
1,2013-BFN-1068512,39.638685,-105.084801
2,2016-BFN-0004094,39.633705,-105.109778
3,2014-BFN-1071718,39.633705,-105.109778
4,2010-BFN-1045692,39.746715,-104.994229


#### Import Marijuana Active Business Licenses CSV

In [101]:
# import csv data to pandas DataFrame
denver_marijuana_businesses_df = pd.read_csv('denver_marijuana_businesses.csv', header=0, sep=',')
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (denver_marijuana_businesses_df.shape[0], denver_marijuana_businesses_df.shape[1]))

Rows: 351
Columns: 13


#### Merge Marijuana Active Business Licenses DataFrames

In [102]:
merged_marijuana_businesses_df = denver_marijuana_businesses_df.merge(marijuana_business_coords, on='Business File Number').reset_index()
merged_marijuana_businesses_df = merged_marijuana_businesses_df[['Business File Number', 'License Type', 'Entity Name', 'Latitude', 'Longitude']]
print('Rows: %i\nColumns: %i' % (merged_marijuana_businesses_df.shape[0], merged_marijuana_businesses_df.shape[1]))
merged_marijuana_businesses_df.head()

Rows: 351
Columns: 5


Unnamed: 0,Business File Number,License Type,Entity Name,Latitude,Longitude
0,2014-BFN-0003949,Retail Marijuana Store,"RJJ SHERIDAN, LLC",39.638685,-105.084801
1,2013-BFN-1068512,Medical Marijuana Center,"RJJ SHERIDAN, LLC",39.638685,-105.084801
2,2016-BFN-0004094,Retail Marijuana Store,"JVT ENTERPRISES, INC.",39.633705,-105.109778
3,2014-BFN-1071718,Medical Marijuana Center,"JVT ENTERPRISES, INC.",39.633705,-105.109778
4,2010-BFN-1045692,Medical Marijuana Center,"ALTERNATIVE MEDICINE ON THE MALL, LLC",39.746715,-104.994229


### Generate Pandas DataFrame of Neighborhood Data

In [103]:
# import csv data to pandas DataFrame
hood_coordinates_df = pd.read_csv('denver_neighborhood_coords.csv', header=0, sep=',').reset_index()
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (hood_coordinates_df.shape[0], hood_coordinates_df.shape[1]))

Rows: 78
Columns: 4


### Generate Pandas DataFrame of Neighborhoods with Counts for Retail Marijuana Stores and Medical Marijuana Centers
#### Create a function for measuring the distance between two different Earth location coordinate points.

In [105]:
# import math library
from math import sin, cos, sqrt, atan2, radians

# Create function for calculating distance between two coordinate points
def coordinate_distance(coordinates1, coordinates2):
    from math import sin, cos, sqrt, atan2, radians

    # approximate radius of earth in km according to wikipedia article titled "Earth Radius"
    R = 6371.0

    lat1 = radians(coordinates1[0])
    lon1 = radians(coordinates1[1])
    lat2 = radians(coordinates2[0])
    lon2 = radians(coordinates2[1])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return(distance)

In [118]:
# test function
coordinates1 = merged_marijuana_businesses_df[['Latitude','Longitude']].iloc[0].values
coordinates2 = hood_coordinates_df[['Latitude','Longitude']].iloc[0].values
print(coordinates1)
print(coordinates2)
coordinate_distance(coordinates1, coordinates2)

[  39.63868547 -105.084801  ]
[  39.70396 -105.01039]


9.65621138471498

In [127]:
# neighborhood count
def store_type_count(neighborhood_coords, business_cords, within_km):
    location_count = []
    for index, hood in neighborhood_coords.iterrows():
        counter = 0
        coordinates1 = [hood[0], hood[1]]
        for index, business in business_cords.iterrows():
            coordinates2 = [business[0], business[1]]
            if coordinate_distance(coordinates1, coordinates2) <= within_km:
                counter += 1
            else:
                continue
        location_count.append(counter)
    return(location_count)

In [128]:
# create data frame for cannabis venue counts
rms_mmc_counts_df = pd.DataFrame(columns=['Neighborhood', 'Retail Marijuana Store', 'Medical Marijuana Center'])
rms_mmc_counts_df['Neighborhood'] = hood_coordinates_df['Neighborhood']
print(rms_mmc_counts_df.shape)
rms_mmc_counts_df.head()

(78, 3)


Unnamed: 0,Neighborhood,Retail Marijuana Store,Medical Marijuana Center
0,Athmar Park,,
1,Auraria,,
2,Baker,,
3,Barnum,,
4,Barnum West,,


In [129]:
# Generate list of Retail Marijuana Store counts by neighborhood
neighborhood_coords = hood_coordinates_df[['Latitude', 'Longitude']]
business_cords = merged_marijuana_businesses_df[['Latitude','Longitude']][merged_marijuana_businesses_df['License Type'] == 'Retail Marijuana Store']
cannabis_retail_count = store_type_count(neighborhood_coords, business_cords, 1.61)
# Add Retail Marijuana Store count to rms_mmc_counts_df
rms_mmc_counts_df['Retail Marijuana Store'] = cannabis_retail_count
# Generate list of Medical Marijuana Center counts by neighborhood
neighborhood_coords = hood_coordinates_df[['Latitude', 'Longitude']]
business_cords = merged_marijuana_businesses_df[['Latitude','Longitude']][merged_marijuana_businesses_df['License Type'] == 'Medical Marijuana Center']
cannabis_retail_count = store_type_count(neighborhood_coords, business_cords, 1.61)
# Add Retail Marijuana Store count to rms_mmc_counts_df
rms_mmc_counts_df['Medical Marijuana Center'] = cannabis_retail_count
# Add total count to rms_mmc_counts_df
rms_mmc_counts_df['RMS_MMC_Total'] = rms_mmc_counts_df['Retail Marijuana Store'] + rms_mmc_counts_df['Medical Marijuana Center']
# Set dataframe to include only neighborhood and total
rms_mmc_counts_df = rms_mmc_counts_df[['Neighborhood', 'RMS_MMC_Total']]
rms_mmc_counts_df.head()

Unnamed: 0,Neighborhood,RMS_MMC_Total
0,Athmar Park,21
1,Auraria,20
2,Baker,27
3,Barnum,15
4,Barnum West,4


### Generate Pandas DataFrame of Neighborhoods with Venue Data for Non-Dispensaries and Dispensaries

In [140]:
# import csv data to pandas DataFrame
denver_venues = pd.read_csv('denver_neighborhood_venues_top.csv', header=0, sep=',').reset_index()
# verify dataframe contents and shape
print('Rows: %i\nColumns: %i' % (denver_venues.shape[0], denver_venues.shape[1]))
denver_venues.head()

Rows: 5696
Columns: 9


Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Top Category
0,0,Athmar Park,39.70396,-105.01039,Super Star Asian Cuisine,39.710007,-105.013715,Dim Sum Restaurant,Food
1,1,Athmar Park,39.70396,-105.01039,Vinh Xuong Bakery (2),39.710609,-105.015232,Vietnamese Restaurant,Food
2,2,Athmar Park,39.70396,-105.01039,Chain Reaction Brewery,39.699577,-105.001335,Brewery,Nightlife Spot
3,3,Athmar Park,39.70396,-105.01039,The Green Solution - Alameda Ave @ West Denver...,39.71153,-105.01829,Marijuana Dispensary,Shop & Service
4,4,Athmar Park,39.70396,-105.01039,Costco Wholesale,39.708594,-105.01428,Department Store,Shop & Service


In [141]:
# drop rows that are categorized as 'Marijuana Dispensary'
contains_marijuana = denver_venues[denver_venues['Venue Category'].str.contains("Marijuana")].index
denver_venues = denver_venues.drop(contains_marijuana, axis=0).reset_index(drop=True)
denver_venues.head()

Unnamed: 0,index,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue Top Category
0,0,Athmar Park,39.70396,-105.01039,Super Star Asian Cuisine,39.710007,-105.013715,Dim Sum Restaurant,Food
1,1,Athmar Park,39.70396,-105.01039,Vinh Xuong Bakery (2),39.710609,-105.015232,Vietnamese Restaurant,Food
2,2,Athmar Park,39.70396,-105.01039,Chain Reaction Brewery,39.699577,-105.001335,Brewery,Nightlife Spot
3,4,Athmar Park,39.70396,-105.01039,Costco Wholesale,39.708594,-105.01428,Department Store,Shop & Service
4,5,Athmar Park,39.70396,-105.01039,New Saigon,39.704848,-105.024811,Vietnamese Restaurant,Food


#### Generate Pandas DataFrame with Total for Each Venue Top Category by Neighborhood
The non-dispensary categories are aggregating in this iteration prior to clustering; however, the non-dispensary venue data will be used in a future iteration.

In [142]:
# one hot encoding
denver_venues_onehot = pd.get_dummies(denver_venues[['Venue Top Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
denver_venues_onehot['Neighborhood'] = denver_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [denver_venues_onehot.columns[-1]] + list(denver_venues_onehot.columns[:-1])
denver_venues_onehot = denver_venues_onehot[fixed_columns]
denver_venues_onehot.head()

Unnamed: 0,Neighborhood,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport
0,Athmar Park,0,0,1,0,0,0,0,0,0
1,Athmar Park,0,0,1,0,0,0,0,0,0
2,Athmar Park,0,0,0,1,0,0,0,0,0
3,Athmar Park,0,0,0,0,0,0,0,1,0
4,Athmar Park,0,0,1,0,0,0,0,0,0


##### Generate Pandas DataFrame of Neighborhoods with Venue Data for Non-Dispensaries and Dispensaries

In [147]:
denver_venues_grouped = denver_venues_onehot.groupby('Neighborhood').sum().sort_values(by=['Neighborhood']).reset_index()
denver_venues_grouped = denver_venues_grouped.merge(rms_mmc_counts_df, on='Neighborhood')
# rename 'RMS_MMC_Total' to 'Marijuana Dispensary'
denver_venues_grouped.rename(columns={'RMS_MMC_Total': 'Marijuana Dispensary'}, inplace=True)
denver_venues_grouped

Unnamed: 0,Neighborhood,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport,Marijuana Dispensary
0,Athmar Park,0,0,35,2,5,2,0,10,0,21
1,Auraria,19,0,43,14,5,1,0,11,7,20
2,Baker,5,0,41,20,3,2,1,23,1,27
3,Barnum,2,0,26,3,2,0,0,12,0,15
4,Barnum West,0,0,12,0,5,0,0,17,1,4
5,Bear Valley,0,0,27,2,2,0,1,21,3,0
6,Belcaro,1,0,42,4,6,0,0,44,2,5
7,Berkeley,4,0,57,11,9,1,0,17,0,7
8,Capitol Hill,11,0,48,17,7,0,0,14,1,28
9,Central Business District,16,0,47,11,7,1,1,5,11,41


#### Normalize the Category values

In [150]:
denver_venues_grouped_mean = denver_venues_grouped
for i, hood in enumerate(denver_venues_grouped_mean.iloc[:,1:11].values):
    denver_venues_grouped_mean.iloc[i,1:11] = (hood / sum(denver_venues_grouped_mean.iloc[i,1:11].values))
denver_venues_grouped_mean

Unnamed: 0,Neighborhood,Arts & Entertainment,College & University,Food,Nightlife Spot,Outdoors & Recreation,Professional & Other Places,Residence,Shop & Service,Travel & Transport,Marijuana Dispensary
0,Athmar Park,0.000000,0.000000,0.466667,0.026667,0.066667,0.026667,0.000000,0.133333,0.000000,0.280000
1,Auraria,0.158333,0.000000,0.358333,0.116667,0.041667,0.008333,0.000000,0.091667,0.058333,0.166667
2,Baker,0.040650,0.000000,0.333333,0.162602,0.024390,0.016260,0.008130,0.186992,0.008130,0.219512
3,Barnum,0.033333,0.000000,0.433333,0.050000,0.033333,0.000000,0.000000,0.200000,0.000000,0.250000
4,Barnum West,0.000000,0.000000,0.307692,0.000000,0.128205,0.000000,0.000000,0.435897,0.025641,0.102564
5,Bear Valley,0.000000,0.000000,0.482143,0.035714,0.035714,0.000000,0.017857,0.375000,0.053571,0.000000
6,Belcaro,0.009615,0.000000,0.403846,0.038462,0.057692,0.000000,0.000000,0.423077,0.019231,0.048077
7,Berkeley,0.037736,0.000000,0.537736,0.103774,0.084906,0.009434,0.000000,0.160377,0.000000,0.066038
8,Capitol Hill,0.087302,0.000000,0.380952,0.134921,0.055556,0.000000,0.000000,0.111111,0.007937,0.222222
9,Central Business District,0.114286,0.000000,0.335714,0.078571,0.050000,0.007143,0.007143,0.035714,0.078571,0.292857


In [151]:
denver_venues_grouped_mean_total = pd.DataFrame(columns=['Neighborhood', 'Marijuana Dispensary', 'Other Venue'])
denver_venues_grouped_mean_total['Neighborhood'] = denver_venues_grouped_mean['Neighborhood']
denver_venues_grouped_mean_total['Marijuana Dispensary'] = denver_venues_grouped_mean['Marijuana Dispensary']
denver_venues_grouped_mean_total['Other Venue'] = denver_venues_grouped_mean.iloc[:,1:10].sum(axis=1)
denver_venues_grouped_mean_total

Unnamed: 0,Neighborhood,Marijuana Dispensary,Other Venue
0,Athmar Park,0.280000,0.720000
1,Auraria,0.166667,0.833333
2,Baker,0.219512,0.780488
3,Barnum,0.250000,0.750000
4,Barnum West,0.102564,0.897436
5,Bear Valley,0.000000,1.000000
6,Belcaro,0.048077,0.951923
7,Berkeley,0.066038,0.933962
8,Capitol Hill,0.222222,0.777778
9,Central Business District,0.292857,0.707143


## Modeling
"Starting with the first version of the prepared data set, data scientists use a training set—historical data in which the outcome of interest is known—to develop predictive or descriptive models using the analytic approach already described. The modeling process is highly iterative."<sup>[1]</sup>

### Cluster Neighborhoods in the City and County of Denver by Venue Type Frequency

In [153]:
from sklearn.cluster import KMeans # k-Means Clustering
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [154]:
# set number of clusters
kclusters = 4

denver_grouped_clustering = denver_venues_grouped_mean_total.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(denver_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 3, 1, 1, 3, 0, 0, 0, 1, 1, 0, 3, 0, 0, 3, 1, 1, 3, 2, 0, 0, 0,
       0, 3, 2, 1, 3, 0, 1, 3, 3, 3, 3, 0, 0, 3, 0, 0, 3, 0, 1, 0, 3, 0,
       1, 3, 1, 0, 2, 1, 1, 0, 1, 1, 0, 0, 3, 0, 3, 3, 3, 1, 1, 0, 0, 0,
       1, 3, 0, 0, 3, 0, 0, 3, 0, 3, 0, 3], dtype=int32)

In [155]:
denver_merged = denver_venues_grouped_mean_total.sort_values(by=['Neighborhood'])

# add clustering labels
denver_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
denver_merged = denver_merged.join(hood_coordinates_df.set_index('Neighborhood'), on='Neighborhood')

denver_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Marijuana Dispensary,Other Venue,Cluster Labels,index,Latitude,Longitude
0,Athmar Park,0.28,0.72,1,0,39.70396,-105.01039
1,Auraria,0.166667,0.833333,3,1,39.74577,-105.01002
2,Baker,0.219512,0.780488,1,2,39.71117,-104.99209
3,Barnum,0.25,0.75,1,3,39.71815,-105.03309
4,Barnum West,0.102564,0.897436,3,4,39.71815,-105.0451


In [156]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(denver_merged['Latitude'], denver_merged['Longitude'], denver_merged['Neighborhood'], denver_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results
The results of the model have placed Denver's 78 neighborhoods into 4 different clusters: Cluster 0, Cluster 1, Cluster 2, and Cluster 3. **The neighborhoods that would be best for opening a new Marijuana Dispensary while being faced with the lowest competition are found in Cluster 0**. 

#### Clusters

##### Cluster 0

In [157]:
denver_merged[denver_merged['Cluster Labels'] == 0].sort_values(by=['Marijuana Dispensary'])

Unnamed: 0,Neighborhood,Marijuana Dispensary,Other Venue,Cluster Labels,index,Latitude,Longitude
5,Bear Valley,0.0,1.0,0,5,39.66172,-105.06561
54,Skyland,0.0,1.0,0,54,39.75758,-104.9499
41,Lowry Field,0.0,1.0,0,41,39.72251,-104.89154
36,Hilltop,0.0,1.0,0,36,39.71861,-104.9246
34,Harvey Park South,0.0,1.0,0,34,39.6625,-105.04227
33,Harvey Park,0.0,1.0,0,33,39.6625,-105.04227
27,Gateway - Green Valley Ranch,0.0,1.0,0,27,39.78254,-104.75254
22,Denver International Airport,0.0,1.0,0,22,39.85138,-104.68096
39,Kennedy,0.017241,0.982759,0,39,39.65945,-104.85898
10,Chaffee Park,0.033898,0.966102,0,10,39.78741,-105.01756


##### Cluster 1

In [158]:
denver_merged[denver_merged['Cluster Labels'] == 1].sort_values(by=['Marijuana Dispensary'])

Unnamed: 0,Neighborhood,Marijuana Dispensary,Other Venue,Cluster Labels,index,Latitude,Longitude
16,Clayton,0.192308,0.807692,1,16,39.76693,-104.95051
61,Sunnyside,0.197917,0.802083,1,61,39.77483,-105.00644
15,Civic Center,0.209677,0.790323,1,15,39.73526,-104.99096
62,Union Station,0.21875,0.78125,1,62,39.75349,-104.99888
2,Baker,0.219512,0.780488,1,2,39.71117,-104.99209
8,Capitol Hill,0.222222,0.777778,1,9,39.7337,-104.97929
44,Montbello,0.222222,0.777778,1,44,39.79321,-104.83386
53,Ruby Hill,0.225352,0.774648,1,53,39.69106,-105.00874
28,Globeville,0.236842,0.763158,1,28,39.78194,-104.98523
25,Five Points,0.24812,0.75188,1,25,39.7592,-104.9876


##### Cluster 2

In [159]:
denver_merged[denver_merged['Cluster Labels'] == 2].sort_values(by=['Marijuana Dispensary'])

Unnamed: 0,Neighborhood,Marijuana Dispensary,Other Venue,Cluster Labels,index,Latitude,Longitude
48,Northeast Park Hill,0.418605,0.581395,2,48,39.77499,-104.92229
18,College View - South Platte,0.455696,0.544304,2,18,39.67854,-105.00314
24,Elyria Swansea,0.489362,0.510638,2,24,39.78196,-104.9591


##### Cluster 3

In [160]:
denver_merged[denver_merged['Cluster Labels'] == 3].sort_values(by=['Marijuana Dispensary'])

Unnamed: 0,Neighborhood,Marijuana Dispensary,Other Venue,Cluster Labels,index,Latitude,Longitude
73,West Colfax,0.084507,0.915493,3,73,39.74035,-105.04144
45,Montclair,0.090909,0.909091,3,45,39.73166,-104.91337
26,Fort Logan,0.095238,0.904762,3,26,39.64154,-105.04649
29,Goldsmith,0.095238,0.904762,3,29,39.67302,-104.91464
31,Hampden,0.1,0.9,3,31,39.66073,-104.88567
30,Hale,0.101852,0.898148,3,30,39.73271,-104.93042
4,Barnum West,0.102564,0.897436,3,4,39.71815,-105.0451
77,Windsor,0.107143,0.892857,3,77,39.70581,-104.89235
38,Jefferson Park,0.107143,0.892857,3,38,39.75121,-105.02135
23,East Colfax,0.111111,0.888889,3,23,39.74126,-104.894


## Discussion
"The data scientist evaluates the model’s quality and checks whether it addresses the business problem fully and appropriately. Doing so requires the computing of various diagnostic measures—as well as other outputs, such as tables and graphs—using a testing set for a predictive model."<sup>[1]</sup>

### Conclusion

The model produced clusters of Denver's 78 neighborhoods based on the k-Means of relative mix of Marijuana Dispensaries and Other Venues. The results of the k-Means clustering are primarily useful for visualization, as the data could simply be sorted in order to identify neighborhoods of lower relative frequency of Marijuana Dispensaries. The Neighborhoods that have a lower relative frequency of Marijuana Dispensaries would be better places to open up a new Marijuana Dispensary.

### Improving the Model

The model could be improved by taking into account the different zoning regulations that impacts the ability of a new marijuana dispensary to open in a particular location. The model could also be improved by taking into account the real estate values of the different neighborhoods, as the barriers to entry become more or less costly when taking this information into account.

## References
    1. Rollins, J. (2015, August 24). Why we need a methodology for data science. Retrieved December 14, 2018, from https://www.ibmbigdatahub.com/blog/why-we-need-methodology-data-science