# David Edwards
## Coursera Applied Data Science Capstone

## Introduction/Business Problem

Cheba Hut (https://chebahut.com/) is a Cannabis-themed sandwich shop based in my hometown of Fort Collins, Colorado.  They are a franchise with approximately 36 locations nationwide.  They are always looking to expand their market, and have a good sense of what makes their franchises work.  The primary indicators of success for them are:
1. Proximity to College/University
2. Lack of alternate restaurant locations
3. Local cannabis laws

I propose performing a "Neighborhood" search that, instead of concentrating on neighborhoods within a city, will concentrate on neighborhoods around colleges/universities in the US.  I propose using the following criteria for determining similarity between Universities:
1. Food (or sandwich shops) per enrolled student
2. laxity of cannibas laws
3. existing Cheba Hut Locations

### Data Sources

#### Marijuana Laws By State
https://data.world/sya/marijuana-laws-by-state
#### College and University Campuses
https://hifld-geoplatform.opendata.arcgis.com/datasets/colleges-and-universities-campuses
#### Cheba Hut Locations
https://chebahut.com/locations

In [57]:
import pandas as pd
import numpy as np
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import folium # map rendering library
# import k-means from clustering stage
from sklearn.cluster import KMeans

In [7]:
print("Hello Capstone Project Course")

Hello Capstone Project Course


In [3]:
import config


In [4]:
df = pd.read_csv("Colleges_and_Universities_Campuses.csv")
df.shape

(6005, 27)

In [5]:
df.head()

Unnamed: 0,OBJECTID,UNIQUEID,NAME,ADDRESS,CITY,STATE,ZIP,ZIP4,TELEPHONE,TYPE,...,SOURCE,SOURCEDATE,VAL_METHOD,VAL_DATE,WEBSITE,TOT_ENROLL,TOT_EMP,SHELTERID,SHAPE__Area,SHAPE__Length
0,1,45821004,WEST COAST UNIVERSITY - ONTARIO CAMPUS,"2855 E. GUASTI RD. ONTARIO, CA 91761",ONTARIO,CA,91761,NOT AVAILABLE,(909) 467-6100,3,...,https://westcoastuniversity.edu/campuses/ontar...,1535068800000,IMAGERY/OTHER,1550448000000,https://westcoastuniversity.edu/campuses/ontar...,-999,-999,NOT AVAILABLE,6893.691406,360.534365
1,2,36639501,SUFFOLK COUNTY COMMUNITY COLLEGE - MICHAEL J. ...,CROOKED HILL ROAD,BRENTWOOD,NY,11717,NOT AVAILABLE,NOT AVAILABLE,1,...,https://www.sunysuffolk.edu/about-suffolk/camp...,1554768000000,IMAGERY/OTHER,1554768000000,http://www3.sunysuffolk.edu/About/CampusInfo.asp,9000,-999,NOT AVAILABLE,831833.769531,3408.024412
2,3,45821003,WEST COAST UNIVERSITY - LA PALMA COUNTY CAMPUS,2411 W. LA PALMA AVE.,ANAHEIM,CA,92801,NOT AVAILABLE,(714) 876-6082,3,...,https://westcoastuniversity.edu/academics/libr...,1550448000000,IMAGERY/OTHER,1550448000000,https://westcoastuniversity.edu/academics/libr...,-999,-999,NOT AVAILABLE,6032.871094,310.813123
3,4,45821002,WEST COAST UNIVERSITY - MIAMI CAMPUS,9250 NW 36TH STREET,DORAL,FL,33178,NOT AVAILABLE,(786) 501-7052,3,...,https://westcoastuniversity.edu/campuses/miami...,1535500800000,IMAGERY/OTHER,1550448000000,https://westcoastuniversity.edu/academics/libr...,-999,-999,NOT AVAILABLE,18280.9375,540.831186
4,5,45821001,WEST COAST UNIVERSITY - LOS ANGELES CENTER FOR...,590 NORTH VERMONT AVENUE,LOS ANGELES,CA,90004,NOT AVAILABLE,(323) 473-5672,3,...,https://westcoastuniversity.edu/campuses/los-a...,1550448000000,IMAGERY/OTHER,1550448000000,https://westcoastuniversity.edu/academics/libr...,-999,-999,NOT AVAILABLE,8609.371094,447.077456


#### We're interested in the names and locations, not so much the rest of this info

In [6]:
df.drop(columns=['OBJECTID', 'ZIP4', 'TELEPHONE', 'SOURCE', 'SOURCEDATE', 'VAL_DATE', 'WEBSITE', 'COUNTY', 'COUNTYFIPS', 'COUNTRY', 'NAICS_CODE', 'NAICS_DESC', 'VAL_METHOD', 'SHELTERID', 'SHAPE__Area', 'SHAPE__Length'], inplace=True)

I work at Colorado State University, so I wanted to see the pertinent data for my location, and to get a better idea of what the TOT_ENROLL and POPULATION means.  Seeing a TOT_ENROLL of 33083 and POPULATION of 40766 tells me that the former is the number of students, and the POPULATION is students+employees.  I'll use total POPULATION as employees need to eat sandwiches also.

In [7]:
df[(df['TOT_ENROLL'] != -999) & (df['STATE']=='CO')].sort_values('TOT_ENROLL', ascending=False)

Unnamed: 0,UNIQUEID,NAME,ADDRESS,CITY,STATE,ZIP,TYPE,STATUS,POPULATION,TOT_ENROLL,TOT_EMP
2973,126614,UNIVERSITY OF COLORADO BOULDER,REGENT DRIVE AT BROADWAY,BOULDER,CO,80309,1,A,44498,35338,9160
2979,126818,COLORADO STATE UNIVERSITY-FORT COLLINS,102 ADMINISTRATION BUILDING,FORT COLLINS,CO,80523,1,A,40766,33083,7683
2980,126827,COLORADO TECHNICAL UNIVERSITY-COLORADO SPRINGS,4435 N CHESTNUT STREET,COLORADO SPRINGS,CO,80907,3,A,27508,25517,1991
2971,126562,UNIVERSITY OF COLORADO DENVER/ANSCHUTZ MEDICAL...,"1380 LAWRENCE STREET, LAWRENCE STREET CENTER, ...",DENVER,CO,80217,1,A,36608,24839,11769
5487,127565,METROPOLITAN STATE UNIVERSITY OF DENVER,SPEER BLVD AND COLFAX AVE,DENVER,CO,80217,1,A,22515,20304,2211
...,...,...,...,...,...,...,...,...,...,...,...
2255,466189,NATIONAL AMERICAN UNIVERSITY-COLORADO SPRINGS ...,"1079 SPACE CENTER DRIVE, SUITE 140",COLORADO SPRINGS,CO,80915,3,A,171,152,19
2652,443632,COLORADO MEDIA SCHOOL,404 SOUTH UPHAM ST.,LAKEWOOD,CO,80226,3,A,149,123,26
2759,461953,COLORADO ACADEMY OF VETERINARY TECHNOLOGY,2766 JANITELL ROAD,COLORADO SPRINGS,CO,80906,3,A,127,109,18
2426,126164,THE SALON PROFESSIONAL ACADEMY-GRAND JUNCTION,432 NORTH AVENUE,GRAND JUNCTION,CO,81501,3,A,76,62,14


### Now I know we can get rid of any POPULATION values of -999.  I will also eliminate POPs < 1000
The dataset mentions that means that -999 are unknown, and I want to eliminate very small schools.

In [12]:
df = df[df['POPULATION'] >=1000]
df.shape

(2930, 11)

#### We need to geocode all of these addresses, and none of our options are very good, so we're going to use the census.
https://geocoding.geo.census.gov/geocoder/Geocoding_Services_API.html

In [13]:
df[['UNIQUEID', 'ADDRESS', 'CITY', 'STATE', 'ZIP']].set_index('UNIQUEID').to_csv("univ_to_geocode.csv")

In [113]:
!curl --form addressFile=@univ_to_geocode.csv --form benchmark=9� https://geocoding.geo.census.gov/geocoder/locations/addressbatch --output geocodeduniversities.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  131k  100   556  100  130k    726   170k --:--:-- --:--:-- --:--:--  171k


In [14]:
geo = pd.read_csv("geocodeduniversities.csv")
geo.head()

Unnamed: 0,<p>While attempting to geocode your batch input,an error occurred validating and processing the parameters that were provided. </p> <p>Please validate the benchmark,vintage (if this is a geographies batch geocode request),and addressFile parameter values that are being used and retry your batch geocode request. </p> <p>More information and documentation (available in HTML and PDF formats) about the Census Geocoder and how to use it can be found here: <a href='https://geocoding.geo.census.gov/geocoder/'>https://geocoding.geo.census.gov/</a></p>


### Well.  The census doesn't work for a lot of our data.  So, we'll sign up for a google maps api key and hope we keep our queries low enough that I don't get billed

In [16]:
import googlemaps

In [17]:
gmaps = googlemaps.Client(key=config.GMAPSKEY)
df["FullAddress"] = df["NAME"] + " " + df["ADDRESS"] + ", " + df["CITY"] + ", " + df["STATE"] + " " + df["ZIP"] 

In [18]:
df["FullAddress"].iloc[0]

'SUFFOLK COUNTY COMMUNITY COLLEGE - MICHAEL J. GRANT CAMPUS CROOKED HILL ROAD, BRENTWOOD, NY 11717'

In [19]:
geocode_result = gmaps.geocode(df["FullAddress"].iloc[0])
geocode_result


[{'access_points': [],
  'address_components': [{'long_name': 'Crooked Hill Road',
    'short_name': 'Crooked Hill Rd',
    'types': ['route']},
   {'long_name': 'Brentwood',
    'short_name': 'Brentwood',
    'types': ['locality', 'political']},
   {'long_name': 'Islip',
    'short_name': 'Islip',
    'types': ['administrative_area_level_3', 'political']},
   {'long_name': 'Suffolk County',
    'short_name': 'Suffolk County',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'New York',
    'short_name': 'NY',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'United States',
    'short_name': 'US',
    'types': ['country', 'political']},
   {'long_name': '11717', 'short_name': '11717', 'types': ['postal_code']}],
  'formatted_address': 'Crooked Hill Rd, Brentwood, NY 11717, USA',
  'geometry': {'location': {'lat': 40.7960983, 'lng': -73.27396490000001},
   'location_type': 'GEOMETRIC_CENTER',
   'viewport': {'northeast': {'lat': 4

In [21]:
flat = json_normalize(geocode_result)
location = geocode_result[0]['geometry']['location']
print (location["lat"], ", ", location["lng"])

# for item in geocode_result[0]['geometry']['location']:
#     print(item["lat"], " ", item["lng"])
# "{}, {}".format(flat["geometry.location.lat"], flat["geometry.location.lng"])

40.7960983 ,  -73.27396490000001


In [50]:
def getSchoolLatLng(address):
    georesult = gmaps.geocode(address)
    if georesult:
        return georesult[0]['geometry']['location']["lat"], georesult[0]['geometry']['location']['lng']
    else:
        return 0,0



Here we geocode all of the universities using my paid API key.  Save to CSV so we don't do that again

In [51]:
# df['Latitude'], df['Longitude'] = zip(*df["FullAddress"].map(getSchoolLatLng))
# df.to_csv("geocoded_universities.csv")

We don't need the full address anymore

In [53]:
df.drop('FullAddress',1,inplace=True)

We can remove the universities where we couldn't find a location
Also, then we can save it so we don't need to call the API again.

In [59]:
df = df[df['Latitude']!=0]
df.to_csv("geocoded_universities.csv")

In [8]:
df = pd.read_csv("geocoded_universities.csv")

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,UNIQUEID,NAME,ADDRESS,CITY,STATE,ZIP,TYPE,STATUS,POPULATION,TOT_ENROLL,TOT_EMP,Latitude,Longitude
0,1,36639501,SUFFOLK COUNTY COMMUNITY COLLEGE - MICHAEL J. ...,CROOKED HILL ROAD,BRENTWOOD,NY,11717,1,A,9000,9000,-999,40.796098,-73.273965
1,121,11891201,MIRACOSTA COLLEGE - SAN ELIJO CAMPUS,3333 MANCHESTER AVENUE,ENCINITAS,CA,92007,1,A,4000,4000,-999,33.017382,-117.258163
2,130,23826303,MADISON AREA TECHNICAL COLLEGE - WEST,302 S GAMMON ROAD,MADISON,WI,53717,1,A,3572,3572,-999,43.0769,-89.526596
3,331,22148506,SOUTHWEST TENNESSEE COMMUNITY COLLEGE - WHITEH...,"3035 DIRECTORS ROW, BUILDING 6",MEMPHIS,TN,38131,1,A,1198,872,326,35.066853,-89.993296
4,415,21988802,COLUMBIA STATE COMMUNITY COLLEGE - LAWRENCE CO...,1620 SPRINGER RD,LAWRENCEBURG,TN,38464,1,A,1084,715,369,35.256213,-87.316339


Let's look at our university map

In [14]:
univ_map = folium.Map(location=[39.829038, -98.579201], zoom_start=4)

# add markers to map
for lat, lng, name in zip(df['Latitude'], df['Longitude'], df['NAME']):
    label = name
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(univ_map)  
    
univ_map

### We're only interested in sandwich shops which coincide with bars.
Cheba hut has a full-service bar inside, which opens it to a different clientele.  Their competitors aren't Subway, more like fast-casual food like Chipotle and Red Robin

Sandwich shop is categoryid = 4bf58dd8d48988d1c5941735

Bar is categoryid = 4bf58dd8d48988d116941735

In [19]:
url = 'https://api.foursquare.com/v2/venues/explore?categoryid=4bf58dd8d48988d1c5941735,4bf58dd8d48988d116941735&client_id={}&client_secret={}&ll={}&v={}&radius={}&limit={}'.format(config.CLIENT_ID, config.CLIENT_SECRET, '40.572760, -105.086184', config.VERSION, 3210, 1000)
results = requests.get(url).json()
venues = results['response']['groups'][0]['items']


In [21]:
results
results['response']['totalResults']

214

In [23]:
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]


# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# nearby_venues[nearby_venues['name'].str.contains('Sub')]
all_venues = json_normalize(venues) # flatten JSON


In [24]:
all_venues.loc[:,["venue.name", "venue.location.lat", "venue.location.lng"]]

Unnamed: 0,venue.name,venue.location.lat,venue.location.lng
0,Ramskeller Pub & Grub,40.574716,-105.085004
1,The Colorado Room,40.578542,-105.076975
2,CSU Bookstore,40.575450,-105.084572
3,Krazy Karl's Pizza,40.575048,-105.097184
4,Cheba Hut Toasted Subs,40.578200,-105.076669
...,...,...,...
95,Jim's Wings,40.574418,-105.097337
96,Hops & Berries,40.586363,-105.075765
97,Insomnia Cookies,40.578318,-105.076799
98,The Gardens on Spring Creek,40.561164,-105.085234


In [44]:
# function to get the number of stores given a lat/lng
def sandwichStoreCount(row):
    url = 'https://api.foursquare.com/v2/venues/explore?categoryid=4bf58dd8d48988d1c5941735,4bf58dd8d48988d116941735&client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(config.CLIENT_ID, config.CLIENT_SECRET, row['Latitude'], row['Longitude'], config.VERSION, 3210, 1000)
    results = requests.get(url).json()
    try:
        return results['response']['totalResults']
    except:
        return 0



In [43]:
df.head().apply (lambda row: sandwichStoreCount(row), axis=1)

0    0
1    0
2    0
3    0
4    0
dtype: int64

In [45]:
df['SandwichStoreCount'] = df.apply (lambda row: sandwichStoreCount(row), axis=1)

In [46]:
df.to_csv("univandsandwich.csv")

In [38]:
df = pd.read_csv("univandsandwich.csv")

In [39]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,UNIQUEID,NAME,ADDRESS,CITY,STATE,ZIP,TYPE,STATUS,POPULATION,TOT_ENROLL,TOT_EMP,Latitude,Longitude,SandwichStoreCount
0,0,1,36639501,SUFFOLK COUNTY COMMUNITY COLLEGE - MICHAEL J. ...,CROOKED HILL ROAD,BRENTWOOD,NY,11717,1,A,9000,9000,-999,40.796098,-73.273965,92
1,1,121,11891201,MIRACOSTA COLLEGE - SAN ELIJO CAMPUS,3333 MANCHESTER AVENUE,ENCINITAS,CA,92007,1,A,4000,4000,-999,33.017382,-117.258163,158
2,2,130,23826303,MADISON AREA TECHNICAL COLLEGE - WEST,302 S GAMMON ROAD,MADISON,WI,53717,1,A,3572,3572,-999,43.0769,-89.526596,158
3,3,331,22148506,SOUTHWEST TENNESSEE COMMUNITY COLLEGE - WHITEH...,"3035 DIRECTORS ROW, BUILDING 6",MEMPHIS,TN,38131,1,A,1198,872,326,35.066853,-89.993296,78
4,4,415,21988802,COLUMBIA STATE COMMUNITY COLLEGE - LAWRENCE CO...,1620 SPRINGER RD,LAWRENCEBURG,TN,38464,1,A,1084,715,369,35.256213,-87.316339,45


In [40]:
df = df.drop(["Unnamed: 0","Unnamed: 0.1", "TYPE", "STATUS", "TOT_ENROLL", "TOT_EMP"],axis=1)

In [41]:
df["StoresPerPerson"] = df["SandwichStoreCount"]/df["POPULATION"]

In [42]:
df.head()

Unnamed: 0,UNIQUEID,NAME,ADDRESS,CITY,STATE,ZIP,POPULATION,Latitude,Longitude,SandwichStoreCount,StoresPerPerson
0,36639501,SUFFOLK COUNTY COMMUNITY COLLEGE - MICHAEL J. ...,CROOKED HILL ROAD,BRENTWOOD,NY,11717,9000,40.796098,-73.273965,92,0.010222
1,11891201,MIRACOSTA COLLEGE - SAN ELIJO CAMPUS,3333 MANCHESTER AVENUE,ENCINITAS,CA,92007,4000,33.017382,-117.258163,158,0.0395
2,23826303,MADISON AREA TECHNICAL COLLEGE - WEST,302 S GAMMON ROAD,MADISON,WI,53717,3572,43.0769,-89.526596,158,0.044233
3,22148506,SOUTHWEST TENNESSEE COMMUNITY COLLEGE - WHITEH...,"3035 DIRECTORS ROW, BUILDING 6",MEMPHIS,TN,38131,1198,35.066853,-89.993296,78,0.065109
4,21988802,COLUMBIA STATE COMMUNITY COLLEGE - LAWRENCE CO...,1620 SPRINGER RD,LAWRENCEBURG,TN,38464,1084,35.256213,-87.316339,45,0.041513


In [9]:
mmjdata = pd.read_csv("mmj-data.csv")

In [10]:
mmjdata.head()

Unnamed: 0,State,Pop,legalWeedStatus,medicinalWeedStatus,decriminalizedWeedStatus,state
0,Alabama,4908621,Fully Illegal,False,False,Alabama
1,Alaska,734002,Fully Legal,True,True,Alaska
2,Arizona,7378494,Mixed,True,False,Arizona
3,Arkansas,3038999,Mixed,True,False,Arkansas
4,California,39937489,Fully Legal,True,True,California


In [11]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'American Samoa': 'AS',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Guam': 'GU',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY'
}


In [14]:
mmjdata["ST"] =  mmjdata.apply (lambda row: us_state_abbrev[row["State"]], axis=1)

In [17]:
mmjdata = mmjdata.drop(['State', 'Pop', 'state'], axis=1)

In [18]:
mmjdata.head()

Unnamed: 0,legalWeedStatus,medicinalWeedStatus,decriminalizedWeedStatus,ST
0,Fully Illegal,False,False,AL
1,Fully Legal,True,True,AK
2,Mixed,True,False,AZ
3,Mixed,True,False,AR
4,Fully Legal,True,True,CA


In [21]:
mmjencoded = pd.get_dummies(mmjdata, columns=["legalWeedStatus", "medicinalWeedStatus", "decriminalizedWeedStatus"])

Merge this dataset with the main one

In [43]:
df = df.merge(mmjencoded, left_on='STATE', right_on='ST').drop('ST', 1)

In [44]:
df.shape


(2864, 18)

Scale the population

In [55]:
# Scale the population
from sklearn import  preprocessing 
sscaler = preprocessing.StandardScaler()
X = np.array(df["POPULATION"]).reshape(-1,1)
df["ScaledPopulation"] = sscaler.fit_transform(X).flatten()

In [52]:
df.head()

Unnamed: 0,UNIQUEID,NAME,ADDRESS,CITY,STATE,ZIP,POPULATION,Latitude,Longitude,SandwichStoreCount,StoresPerPerson,legalWeedStatus_Fully Illegal,legalWeedStatus_Fully Legal,legalWeedStatus_Mixed,medicinalWeedStatus_False,medicinalWeedStatus_True,decriminalizedWeedStatus_False,decriminalizedWeedStatus_True,ScaledPopulation
0,36639501,SUFFOLK COUNTY COMMUNITY COLLEGE - MICHAEL J. ...,CROOKED HILL ROAD,BRENTWOOD,NY,11717,9000,40.796098,-73.273965,92,0.010222,0,0,1,0,1,1,0,0.070966
1,189228,BERKELEY COLLEGE-NEW YORK,3 EAST 43 STREET,NEW YORK,NY,10017,4155,40.75392,-73.979502,233,0.056077,0,0,1,0,1,1,0,-0.363629
2,196565,TOMPKINS CORTLAND COMMUNITY COLLEGE,170 NORTH ST,DRYDEN,NY,13053,3094,42.502126,-76.287671,11,0.003555,0,0,1,0,1,1,0,-0.4588
3,196592,TOURO COLLEGE,500 7TH AVENUE,NEW YORK,NY,10018,14505,40.753159,-73.98936,235,0.016201,0,0,1,0,1,1,0,0.564763
4,188526,ALBANY COLLEGE OF PHARMACY AND HEALTH SCIENCES,106 NEW SCOTLAND AVENUE,ALBANY,NY,12208,1693,42.652196,-73.778839,98,0.057885,0,0,1,0,1,1,0,-0.584469


In [53]:
df.columns

Index(['UNIQUEID', 'NAME', 'ADDRESS', 'CITY', 'STATE', 'ZIP', 'POPULATION',
       'Latitude', 'Longitude', 'SandwichStoreCount', 'StoresPerPerson',
       'legalWeedStatus_Fully Illegal', 'legalWeedStatus_Fully Legal',
       'legalWeedStatus_Mixed', 'medicinalWeedStatus_False',
       'medicinalWeedStatus_True', 'decriminalizedWeedStatus_False',
       'decriminalizedWeedStatus_True', 'ScaledPopulation'],
      dtype='object')

In [54]:
featureset = df[['StoresPerPerson',
       'legalWeedStatus_Fully Illegal', 'legalWeedStatus_Fully Legal',
       'legalWeedStatus_Mixed', 'medicinalWeedStatus_False',
       'medicinalWeedStatus_True', 'decriminalizedWeedStatus_False',
       'decriminalizedWeedStatus_True', 'ScaledPopulation']]
featureset.head()

Unnamed: 0,StoresPerPerson,legalWeedStatus_Fully Illegal,legalWeedStatus_Fully Legal,legalWeedStatus_Mixed,medicinalWeedStatus_False,medicinalWeedStatus_True,decriminalizedWeedStatus_False,decriminalizedWeedStatus_True,ScaledPopulation
0,0.010222,0,0,1,0,1,1,0,0.070966
1,0.056077,0,0,1,0,1,1,0,-0.363629
2,0.003555,0,0,1,0,1,1,0,-0.4588
3,0.016201,0,0,1,0,1,1,0,0.564763
4,0.057885,0,0,1,0,1,1,0,-0.584469


In [61]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(featureset)

# check cluster labels generated for each row in the dataframe
df['Labels'] = kmeans.labels_

In [70]:
df.shape

(2864, 20)

In [74]:
df.drop(['UNIQUEID', 'ZIP', 'Latitude', 'Longitude','Labels'],1).describe()

Unnamed: 0,POPULATION,SandwichStoreCount,StoresPerPerson,legalWeedStatus_Fully Illegal,legalWeedStatus_Fully Legal,legalWeedStatus_Mixed,medicinalWeedStatus_False,medicinalWeedStatus_True,decriminalizedWeedStatus_False,decriminalizedWeedStatus_True,ScaledPopulation
count,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0,2864.0
mean,8208.84602,93.647346,0.026557,0.189246,0.217877,0.592877,0.352304,0.647696,0.782123,0.217877,-5.954269000000001e-17
std,11150.256153,67.038751,0.034471,0.391772,0.412876,0.491384,0.477772,0.477772,0.412876,0.412876,1.000175
min,1000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.6466313
25%,2128.25,38.0,0.005756,0.0,0.0,0.0,0.0,0.0,1.0,0.0,-0.5454276
50%,4078.0,81.0,0.013444,0.0,0.0,1.0,0.0,1.0,1.0,0.0,-0.3705356
75%,9454.0,136.0,0.031488,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.1116899
max,115340.0,243.0,0.239409,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.609632


In [67]:
labelColors = ['yellow', 'green', 'blue', 'red', 'purple']
univ_map = folium.Map(location=[39.829038, -98.579201], zoom_start=4)

# add markers to map
for lat, lng, name, neighborhood in zip(df['Latitude'], df['Longitude'], df['NAME'], df["Labels"]):
    label = name + "KGroup: " + str(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=labelColors[neighborhood],
        fill=True,
        fill_color=labelColors[neighborhood],
        fill_opacity=0.7,
        parse_html=False).add_to(univ_map)  
    
univ_map

### I see multiple locations that already have these restaurants, so let's do a map with just that neighborhood

In [68]:
prime_map = folium.Map(location=[39.829038, -98.579201], zoom_start=4)
dfPrime = df[df["Labels"]==3]
# add markers to map
for lat, lng, name, neighborhood in zip(dfPrime['Latitude'], dfPrime['Longitude'], dfPrime['NAME'], dfPrime["Labels"]):
    label = name + " KGroup: " + str(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=labelColors[neighborhood],
        fill=True,
        fill_color=labelColors[neighborhood],
        fill_opacity=0.7,
        parse_html=False).add_to(prime_map)  
    
prime_map

In [73]:
dfPrime.drop(['UNIQUEID', 'ZIP', 'Latitude', 'Longitude','Labels'],1).describe()

Unnamed: 0,POPULATION,SandwichStoreCount,StoresPerPerson,legalWeedStatus_Fully Illegal,legalWeedStatus_Fully Legal,legalWeedStatus_Mixed,medicinalWeedStatus_False,medicinalWeedStatus_True,decriminalizedWeedStatus_False,decriminalizedWeedStatus_True,ScaledPopulation
count,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0,174.0
mean,43333.149425,153.810345,0.003923,0.109195,0.218391,0.672414,0.321839,0.678161,0.781609,0.218391,3.150639
std,16315.047657,51.011658,0.001703,0.312784,0.414346,0.470688,0.46853,0.46853,0.414346,0.414346,1.463455
min,25466.0,55.0,0.000761,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.547962
25%,32494.25,120.25,0.002584,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.178393
50%,37938.5,142.5,0.003603,0.0,0.0,1.0,0.0,1.0,1.0,0.0,2.666741
75%,49765.5,195.5,0.004977,0.0,0.0,1.0,1.0,1.0,1.0,0.0,3.727619
max,115340.0,243.0,0.008922,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9.609632
