# Capstone Project - Battle of the Neighbourhoods (Week 2)

## Introduction: Business Problem

In this project, we will try to find the ideal location for a pharmacy in the Greater Orlando Area in Florida, which is part of the United States of America.

Because Florida was one of the states hit hardest by the latest wave of COVID-19, it is likely that the pharmacies in Florida are understocked, serving too many, or are too from from too many people at the current moment. Orlando is likely facing this chaos as one of the largest cities in Florida. The addition of one pharmacy will benefit not only residents living close to the new pharmacy, but also help any potential pharmacists find a place to conduct their business. 

With the use of data science, I will be able to find the locations within the Greater Orlando Area that contain the least amount of pharmacies and thus, the places likeliest in need of an additional pharmacy.

## Data

Based on what's been outlined so far, I will need to use the Foursquare map data in order find neighbourhoods with the least number of pharmacies in the Greater Orlando Area.

Before that happens, however, I will need data on the Greater Orlando Area first to find out how it is divided. Once I find that out, I can split my search accordingly, find the 
areas most in need for each division and compare.

## Steps in the process

- step 1: extract data location about Greater Orlando from a website such as Wikipedia to find out how Greater Orlando is divided.
- step 2: use Nominatim from geocoder to find out approximate coordinates for each division of Greater Orlando.
- step 3: further divide Greater Orlando's counties into ZIP codes for increased accuracy of data.
- step 4: map data using Folium
- step 5: use dataframes from pandas to take data from Foursquare to find the frequency of pharmacies around these ZIP codes.
- step 6: extract population data for each ZIP code from a website to find out the population of the ZIP code compared to the number of pharmacies in the area.
- step 6: use the data from the dataframes to split areas into clusters based on frequency of pharmacies and population.
- step 7: use folium once again to see clusters
- step 8: compare data between divisions of Greater Orlando and determine conclusions.

## Find out how the Greater Orlando Area is divided

In [1]:
import random # library for random number generation
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes

import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs
import json # library to handle JSON files

!pip install geopy # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!pip -q install folium
import folium # map rendering library

!pip install shapely
import shapely.geometry

!pip install pyproj
import pyproj

!pip install lxml
!pip3 install lxml

import math
print('Libraries imported.')

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/07/e1/9c72de674d5c2b8fcb0738a5ceeb5424941fefa080bfe4e240d0bacb5a38/geopy-2.0.0-py3-none-any.whl (111kB)
[K     |████████████████████████████████| 112kB 2.6MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.0.0
Collecting shapely
[?25l  Downloading https://files.pythonhosted.org/packages/9d/18/557d4f55453fe00f59807b111cc7b39ce53594e13ada88e16738fb4ff7fb/Shapely-1.7.1-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 7.1MB/s eta 0:00:01
[?25hInstalling collected packages: shapely
Successfully installed shapely-1.7.1
Collecting lxml
[?25l  Downloading https://files.pythonhosted.org/packages/79/37/

From scraping the 'Greater Orlando' page on Wikipedia, I find that...

In [2]:
table = pd.read_html("https://en.wikipedia.org/wiki/Greater_Orlando")[2] #creating dataset
table

Unnamed: 0,County,2016 Estimate,2010 Census,Change,Area,Density
0,Lake County,335396,297052,+12.91%,"938.38 sq mi (2,430.4 km2)",2)
1,Orange County,1314367,1145956,+14.70%,"903.43 sq mi (2,339.9 km2)",2)
2,Osceola County,336015,268685,+25.06%,"1,327.45 sq mi (3,438.1 km2)",2)
3,Seminole County,455479,422718,+7.75%,309.22 sq mi (800.9 km2),2)
4,Total,2441257,2134411,+14.38%,"3,478.48 sq mi (9,009.2 km2)",2)


...it is split into four different counties.

In [3]:
orlando_counties = table.iloc[:,0] #extracting the names of the counties for future use
orlando_counties

0        Lake County
1      Orange County
2     Osceola County
3    Seminole County
4              Total
Name: County, dtype: object

In [4]:
orlando_counties = orlando_counties[0:4] #delete "total" from the list
orlando_counties

0        Lake County
1      Orange County
2     Osceola County
3    Seminole County
Name: County, dtype: object

## Coordinates of each County

I am now going to find the coordinates of each county in order to find the approximate area to use the Foursquare API on for later.

In [5]:
column_names = ['Area','Latitude','Longitude']
county_data = pd.DataFrame(columns = column_names) #create columns for use in the next cell
county_data

Unnamed: 0,Area,Latitude,Longitude


In [6]:

x=0
while x<len(orlando_counties): #use of while loop because for loop doesn't seem to work for me
    address = orlando_counties[x]
    geolocator = Nominatim(user_agent="ny_explorer") #find coordinates of each county
    location = geolocator.geocode('{}, Florida'.format(address)) #mention of florida is necessary to avoid coordinates found in California instead
    latitude = location.latitude
    longitude = location.longitude
    county_data = county_data.append({'Area': address, 'Latitude': latitude,'Longitude': longitude}, ignore_index=True) #appending all of these to the list created last cell

    x= x+1

county_data

Unnamed: 0,Area,Latitude,Longitude
0,Lake County,28.700686,-81.78994
1,Orange County,28.542111,-81.37903
2,Osceola County,28.044384,-81.143754
3,Seminole County,28.722583,-81.235368


Let's visualize the data so far.

In [7]:

geolocator = Nominatim(user_agent="ny_explorer") 
location = geolocator.geocode('Orlando') 
orlando_latitude = location.latitude
orlando_longitude = location.longitude #find approximate coordinates for Orlando itself 

map_orlando = folium.Map(location=[orlando_latitude, orlando_longitude], zoom_start=10) #creates a Folium map to show where each county is in respect to each other

# add markers to map
for lat, lng, area in zip(county_data['Latitude'], county_data['Longitude'], county_data['Area']):
    label = '{}'.format(area)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=15,
        popup=label,
        color='Blue',
        fill=True,
        fill_color='black',
        fill_opacity=0.7,
        parse_html=False).add_to(map_orlando)  
    
map_orlando

Within each county are townships and other subdivisions. Getting the coordinates of these subdivisions will probably lead to a more accurate result since the real counties are not circular in shape.

As such, I shall collect the zip codes of the counties in Greater Orlando.

In [8]:
lake_zip = pd.read_html("https://www.zipdatamaps.com/lake-fl-county-zipcodes")[1]
lake_zip2 = lake_zip.pop('List of Zipcodes in Lake County, Florida')
lake_zip2.pop('Type')
lake_zip2.pop('Population')
lake_zip2 = lake_zip2.dropna()
print(lake_zip2.shape)
lake_zip2 = lake_zip2.astype({'ZIP Code': 'int'})
lake_zip2 = lake_zip2.assign(County = 'Lake County')
lake_zip2.head(6)

(36, 2)


Unnamed: 0,ZIP Code,ZIP Code Name,County
0,32102,Astor,Lake County
1,32159,Lady Lake,Lake County
2,32195,Weirsdale,Lake County
3,32702,Altoona,Lake County
4,32720,DeLand,Lake County
6,32726,Eustis,Lake County


In [9]:
orange_zip = pd.read_html("https://www.zipdatamaps.com/orange-fl-county-zipcodes")[1]
orange_zip2 = orange_zip.pop('List of Zipcodes in Orange County, Florida')
orange_zip2.pop('Type')
orange_zip2.pop('Population')
orange_zip2 = orange_zip2.dropna()
print(orange_zip2.shape)
orange_zip2 = orange_zip2.assign(County = 'Orange County')
orange_zip2.head(6)

(85, 2)


Unnamed: 0,ZIP Code,ZIP Code Name,County
0,32703.0,Apopka,Orange County
1,32709.0,Christmas,Orange County
2,32712.0,Apopka,Orange County
3,32751.0,Maitland,Orange County
4,32757.0,Mount Dora,Orange County
6,32776.0,Sorrento,Orange County


In [10]:
osceola_zip = pd.read_html("https://www.zipdatamaps.com/osceola-fl-county-zipcodes")[1]
osceola_zip2 = osceola_zip.pop('List of Zipcodes in Osceola County, Florida')
osceola_zip2.pop('Type')
osceola_zip2.pop('Population')
osceola_zip2 = osceola_zip2.dropna()
print(osceola_zip2.shape)
osceola_zip2 = osceola_zip2.assign(County = 'Osceola County')
osceola_zip2.head(6)

(19, 2)


Unnamed: 0,ZIP Code,ZIP Code Name,County
0,33844.0,Haines City,Osceola County
1,33896.0,Davenport,Osceola County
2,33898.0,Lake Wales,Osceola County
3,34739.0,Kenansville,Osceola County
4,34741.0,Kissimmee,Osceola County
6,34743.0,Kissimmee,Osceola County


In [11]:
seminole_zip = pd.read_html("https://www.zipdatamaps.com/seminole-fl-county-zipcodes")[1]
seminole_zip2 = seminole_zip.pop('List of Zipcodes in Seminole County, Florida')
seminole_zip2.pop('Type')
seminole_zip2.pop('Population')
seminole_zip2 = seminole_zip2.dropna()
print(seminole_zip2.shape)
seminole_zip2 = seminole_zip2.assign(County = 'Seminole County')
seminole_zip2.head(6)

(28, 2)


Unnamed: 0,ZIP Code,ZIP Code Name,County
0,32701.0,Altamonte Springs,Seminole County
1,32703.0,Apopka,Seminole County
2,32707.0,Casselberry,Seminole County
3,32708.0,Winter Springs,Seminole County
4,32714.0,Altamonte Springs,Seminole County
6,32730.0,Casselberry,Seminole County


In [12]:
zipdata = [orange_zip2, osceola_zip2, seminole_zip2]
zipdata2 = lake_zip2.append(zipdata)
zipdata2 = zipdata2.sort_values(by='ZIP Code').reset_index(drop=True)
zipdata2 = zipdata2.astype({'ZIP Code': 'int'})
zipdata2

Unnamed: 0,ZIP Code,ZIP Code Name,County
0,32102,Astor,Lake County
1,32158,Lady Lake,Lake County
2,32159,Lady Lake,Lake County
3,32195,Weirsdale,Lake County
4,32701,Altamonte Springs,Seminole County
...,...,...,...
163,34787,Winter Garden,Orange County
164,34787,Winter Garden,Lake County
165,34788,Leesburg,Lake County
166,34789,Leesburg,Lake County


With all of the ZIP Codes in the counties collected, it's time to collect their approximate coordinates. I will be using Nominatim's API since it allows for unlimited requests for free.

In [13]:
column_names2 = ['ZIP Code','ZIP Code Name','Latitude','Longitude']
zip_data = pd.DataFrame(columns = column_names2) #create columns for use in the next cell
zip_data

Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude


In [14]:
def get_coordinates(ZIP_code):
    try:
        url = 'https://nominatim.openstreetmap.org/search?state=florida&postalcode={}&format=json'.format(ZIP_code)
        response = requests.get(url).json()
        # get geographical coordinates
        lat = response[0]['lat']
        lon = response[0]['lon']
        return [lat, lon]
    except:
        return [None,None]

In [15]:
x=0
while x<len(zipdata2): #use of while loop because for loop doesn't seem to work for me
    zip_code = zipdata2.iloc[x,0]
    zip_area = zipdata2.iloc[x,1]
    latlon = get_coordinates(zip_code)
    lat = latlon[0]
    lon = latlon[1]
    zip_data = zip_data.append({'ZIP Code': zip_code, 'ZIP Code Name': zip_area, 'Latitude': lat, 'Longitude': lon}, ignore_index=True) #appending all of these to the list created last cell
    print(zip_code, zip_area, lat,lon)
    x= x+1
print("Coordinates gathered.")


32102 Astor 29.163126506600804 -81.54282155818989
32158 Lady Lake 28.92539122641275 -81.91390305436343
32159 Lady Lake 28.92686967123771 -81.92151788533704
32195 Weirsdale 28.981785966445898 -81.90928269373009
32701 Altamonte Springs 28.6629017663051 -81.37159257460407
32702 Altoona 29.01363534361128 -81.634212836383
32703 Apopka 28.663528061363316 -81.47442658691442
32703 Apopka 28.663528061363316 -81.47442658691442
32704 Apopka 28.67049762984486 -81.52789408787604
32707 Casselberry 28.66474674431558 -81.31995878419083
32708 Winter Springs 28.68419934136601 -81.27723664748983
32709 Christmas 28.564217550000002 -81.06555334999999
32710 Clarcona 28.6136739 -81.4826138
32712 Apopka 28.7264304356021 -81.52190042511458
32714 Altamonte Springs 28.663925705082022 -81.4136105572887
32715 Altamonte Springs None None
32716 Altamonte Springs None None
32718 Casselberry None None
32719 Winter Springs None None
32720 DeLand 29.02662943260501 -81.34023263363747
32726 Eustis 28.850021157207813 -81.6

Although the coordinates for most of the ZIP codes were collected, it looks like some returned as 'None, None'. This is due to Nominatim's less user-friendly coordinate lookup for American ZIP codes compared to other APIs like Google's that require API keys, credentials, and payment per request. I've tried to optimize Nominatim's searching capabilities for American ZIP codes as much as I could, and this is the result. (Note: ZIP codes that are 5 numbers long are used in other countries around the world, and as such, simply searching for the zip code without specifying 'Florida' could result in coordinates found in other countries. However, even this is not enough to get all the results because Nominatim's database does not attribute a state to every zip code for some reason; for example, some zip codes show up without the 'Florida' specification. 

In [16]:
print(zip_data.shape)
zip_data = zip_data.dropna()
print(zip_data.shape)
zip_data = zip_data.reset_index(drop=True)
zip_data

(168, 4)
(115, 4)


Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude
0,32102,Astor,29.163126506600804,-81.54282155818989
1,32158,Lady Lake,28.92539122641275,-81.91390305436343
2,32159,Lady Lake,28.92686967123771,-81.92151788533704
3,32195,Weirsdale,28.981785966445898,-81.90928269373009
4,32701,Altamonte Springs,28.6629017663051,-81.37159257460407
...,...,...,...,...
110,34786,Windermere,28.453922904007165,-81.56656674423485
111,34787,Winter Garden,28.4962862861846,-81.6111078205875
112,34787,Winter Garden,28.4962862861846,-81.6111078205875
113,34788,Leesburg,28.85691542739478,-81.78048769958376


In [17]:
zip_data2 = zip_data.merge(zipdata2, on =['ZIP Code','ZIP Code Name'],how='inner')
zip_data2

Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude,County
0,32102,Astor,29.163126506600804,-81.54282155818989,Lake County
1,32158,Lady Lake,28.92539122641275,-81.91390305436343,Lake County
2,32159,Lady Lake,28.92686967123771,-81.92151788533704,Lake County
3,32195,Weirsdale,28.981785966445898,-81.90928269373009,Lake County
4,32701,Altamonte Springs,28.6629017663051,-81.37159257460407,Seminole County
...,...,...,...,...,...
126,34787,Winter Garden,28.4962862861846,-81.6111078205875,Lake County
127,34787,Winter Garden,28.4962862861846,-81.6111078205875,Orange County
128,34787,Winter Garden,28.4962862861846,-81.6111078205875,Lake County
129,34788,Leesburg,28.85691542739478,-81.78048769958376,Lake County


It looks like, due to Nominatim's lower acccuracy, a little more than a quarter of the zip codes had to be ommitted. Luckily, around half of them were Orlando, so it shouldn't have too much of an effect on finding pharmacies. 

In [18]:
zip_data2 = zip_data2.astype({'Latitude': 'float64'})
zip_data2 = zip_data2.astype({'Longitude':'float64'})

In [19]:
map_orlando2 = folium.Map(location=[orlando_latitude, orlando_longitude], zoom_start=9.5) #creates a Folium map to show where each county is in respect to each other

# add markers to map

for lat, lng, code, area in zip(zip_data2['Latitude'], zip_data2['Longitude'], zip_data2['ZIP Code'], zip_data2['ZIP Code Name']):
    label = '{}, {}, {}, {}'.format(code, area, lat, lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=5, popup=label, color='Blue', fill=True, fill_color='black', fill_opacity=0.7, parse_html=False).add_to(map_orlando2)  

map_orlando2

With that out of the way, I wanted to add a population per ZIP code to the table. This would help to create more unique clustering than simply having the pharmacy count.

In [20]:
zip_population = pd.read_html('https://worldpopulationreview.com/zips/florida')[0]
zip_population.columns = ['ZIP Code', 'ZIP Code Name', 'County', 'Population']
zip_population

Unnamed: 0,ZIP Code,ZIP Code Name,County,Population
0,33012,Hialeah,Miami-Dade County,75666.0
1,33024,Hollywood,Broward County,75306.0
2,33023,Hollywood,Broward County,73671.0
3,33311,Fort Lauderdale,Broward County,73034.0
4,33025,Hollywood,Broward County,71763.0
...,...,...,...,...
956,34101,Naples,Collier County,
957,32530,Bagdad,Santa Rosa County,
958,33945,Pineland,Lee County,
959,32432,Cypress,Jackson County,


These are the populations for each ZIP code in all of Florida, but I can merge the tables in a way that will filter out the ZIP codes that are unused. Keeping the 'County' separate from table to table is necessary because ZIP codes can be in multiple counties at the same time (as seen in Weirsdale, 4th row). 

In [21]:
zip_data_pop = zip_data2.merge(zip_population, on=['ZIP Code', 'ZIP Code Name'],how='outer')
zip_data_pop['Population'] = zip_data_pop['Population'].replace(np.nan,0)
zip_data_pop = zip_data_pop.dropna(thresh = 5).reset_index(drop=True)
zip_data_pop

Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude,County_x,County_y,Population
0,32102.0,Astor,29.163127,-81.542822,Lake County,Lake County,2195.0
1,32158.0,Lady Lake,28.925391,-81.913903,Lake County,,0.0
2,32159.0,Lady Lake,28.926870,-81.921518,Lake County,Lake County,30135.0
3,32195.0,Weirsdale,28.981786,-81.909283,Lake County,Marion County,2986.0
4,32701.0,Altamonte Springs,28.662902,-81.371593,Seminole County,Seminole County,21997.0
...,...,...,...,...,...,...,...
126,34787.0,Winter Garden,28.496286,-81.611108,Lake County,Orange County,64723.0
127,34787.0,Winter Garden,28.496286,-81.611108,Orange County,Orange County,64723.0
128,34787.0,Winter Garden,28.496286,-81.611108,Lake County,Orange County,64723.0
129,34788.0,Leesburg,28.856915,-81.780488,Lake County,Lake County,16886.0


## Foursquare API

Now, it's time to use the Foursquare API to find pharmacies that can be found nearby. Foursquare will be finding all pharmacies in a 10 kilometer radius around each ZIP code's coordinates. Inevitably, duplicates will show up, but they can be removed later. This large radius is necessary since some of the ZIP codes located further away from Orlando are quite rural. This radius size also prevents the code from crashing randomly at times.

In [22]:
CLIENT_ID = 'XFI3JL33WMJKFPGACLK2TT5KLLT4E4520WVL211BVWUOMZZ5' # your Foursquare ID
CLIENT_SECRET = '0VRUP1TO1IKLN4SZJ2QXV2EHEEARF4SZUMH5SNBIRTPMSEQG' # your Foursquare Secret



In [23]:
category = '4bf58dd8d48988d10f951735'
def getAllPharmacies(zips, zip_areas, latitudes, longitudes, radius=10000, LIMIT= 2000):
    venues_list = []
    VERSION = '20180605' # Foursquare API version
    # create the API request URL
    for zips, zip_areas, lat, lng in zip(zips, zip_areas, latitudes, longitudes):
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, category, radius, LIMIT)
        print(zips, zip_areas, lat, lng)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        try:
            # return only relevant information for each nearby venue
            venues_list.append([(zips, zip_areas, lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng']) for v in results])

            nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
            nearby_venues.columns = ['ZIP Code', 'ZIP Code Name', 'ZIP Latitude', 'ZIP Longitude','Pharmacy Name', 'Pharmacy Latitude', 'Pharmacy Longitude']
        except:
            pass
    print("Locations gathered.")
    return nearby_venues

In [24]:
pharmacy_list = getAllPharmacies(zip_data_pop['ZIP Code'], zip_data_pop['ZIP Code Name'], zip_data_pop['Latitude'], zip_data_pop['Longitude'])

32102.0 Astor 29.163126506600804 -81.54282155818989
32158.0 Lady Lake 28.92539122641275 -81.91390305436343
32159.0 Lady Lake 28.92686967123771 -81.92151788533704
32195.0 Weirsdale 28.981785966445898 -81.90928269373009
32701.0 Altamonte Springs 28.6629017663051 -81.37159257460407
32702.0 Altoona 29.01363534361128 -81.634212836383
32703.0 Apopka 28.663528061363316 -81.47442658691442
32703.0 Apopka 28.663528061363316 -81.47442658691442
32703.0 Apopka 28.663528061363316 -81.47442658691442
32703.0 Apopka 28.663528061363316 -81.47442658691442
32704.0 Apopka 28.67049762984486 -81.52789408787604
32707.0 Casselberry 28.66474674431558 -81.31995878419083
32708.0 Winter Springs 28.68419934136601 -81.27723664748983
32709.0 Christmas 28.564217550000002 -81.06555334999999
32710.0 Clarcona 28.6136739 -81.4826138
32712.0 Apopka 28.7264304356021 -81.52190042511458
32714.0 Altamonte Springs 28.663925705082022 -81.4136105572887
32720.0 DeLand 29.02662943260501 -81.34023263363747
32726.0 Eustis 28.85002115

In [25]:
pharmacy_list.shape

(5181, 7)

Quite a large amount of pharmacies in the area, huh. But there is expected to be a significant amount of overlap. However, due to how this has been set up, we can take a glimpse as to how many pharmacies are within a 10 kilometer radius of each ZIP code, which will be done later. Now let's look at how many pharmacies there really are around Greater Orlando.

In [26]:
pharmacy_list2 = pharmacy_list.drop(['ZIP Code', 'ZIP Code Name', 'ZIP Latitude', 'ZIP Longitude'], axis=1).drop_duplicates()
pharmacy_list2 = pharmacy_list2.reset_index(drop=True)
pharmacy_list2

Unnamed: 0,Pharmacy Name,Pharmacy Latitude,Pharmacy Longitude
0,Walgreens,28.934003,-81.937640
1,Vitamin Shoppe,28.932941,-81.934666
2,Publix,28.939711,-81.947891
3,CVS pharmacy,28.936458,-81.941392
4,CVS pharmacy,28.936772,-81.941651
...,...,...,...
575,Walgreens,28.246568,-81.243148
576,Walgreens,28.197232,-81.293501
577,Publix,28.247681,-81.242724
578,Publix,28.196212,-81.292563


Alright, now with the coordinates, we can plot the pharmacies on the map in relation to the ZIP codes.

In [27]:
# add markers to map
for lat, lng, name in zip(pharmacy_list2['Pharmacy Latitude'], pharmacy_list2['Pharmacy Longitude'], pharmacy_list2['Pharmacy Name']):
    label = '{}, {}, {}'.format(name, lat, lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=2.5, popup=label, color='Green', fill=True, fill_color='Yellow', fill_opacity=0.7, parse_html=False).add_to(map_orlando2)      
map_orlando2

With all the data now gathered, it's time to move on to finding some ideal locations for a new pharmacy.

## Methodology

In order to find the ideal location for the new pharmacy, I have to set down some requirements. Ideally, the location would have the highest population and least amount of pharmacies in the nearby area. Having 0 pharmacies is especially preferable since it is a service that is not already present in the area.

In order to reach these requirements, I need to count the number of pharmacies nearby for each ZIP code. After doing so, I can create clusters to find the find the zip codes with the most favourable combination of population and number of pharmacies.

Firstly, I need to remove the duplicate results.

In [28]:
pharmacy_list.shape
pharmacy_list= pharmacy_list.drop_duplicates()
pharmacy_list.shape

(4218, 7)

Now the number of pharmacies within the 10 kilometer radius of the ZIP codes will be counted.

In [29]:
area_pharmacy_count = pharmacy_list.groupby(pharmacy_list['ZIP Code']).count()
area_pharmacy_count = area_pharmacy_count[['ZIP Code Name']]
area_pharmacy_count.columns = ['Pharmacy Count']
area_pharmacy_count = area_pharmacy_count.reset_index()
area_pharmacy_count.sort_values(by='Pharmacy Count')

Unnamed: 0,ZIP Code,Pharmacy Count
4,32702.0,1
17,32736.0,1
76,34737.0,2
29,32776.0,2
85,34753.0,3
...,...,...
37,32801.0,100
46,32811.0,100
35,32792.0,100
21,32751.0,100


Looks like 15 ZIP codes do not have any pharmacies in the radius. Now it's time to merge this data with the data from before for a more complete table.

In [30]:
zip_pharm_pop = zip_data_pop.merge(area_pharmacy_count, on ='ZIP Code',how='outer').drop_duplicates().reset_index(drop=True)
zip_pharm_pop 

Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude,County_x,County_y,Population,Pharmacy Count
0,32102.0,Astor,29.163127,-81.542822,Lake County,Lake County,2195.0,
1,32158.0,Lady Lake,28.925391,-81.913903,Lake County,,0.0,23.0
2,32159.0,Lady Lake,28.926870,-81.921518,Lake County,Lake County,30135.0,22.0
3,32195.0,Weirsdale,28.981786,-81.909283,Lake County,Marion County,2986.0,17.0
4,32701.0,Altamonte Springs,28.662902,-81.371593,Seminole County,Seminole County,21997.0,83.0
...,...,...,...,...,...,...,...,...
110,34786.0,Windermere,28.453923,-81.566567,Orange County,Orange County,43458.0,37.0
111,34787.0,Winter Garden,28.496286,-81.611108,Orange County,Orange County,64723.0,22.0
112,34787.0,Winter Garden,28.496286,-81.611108,Lake County,Orange County,64723.0,22.0
113,34788.0,Leesburg,28.856915,-81.780488,Lake County,Lake County,16886.0,12.0


I can't do much with 'NaN', so I'll turn all of them into 0 due to lack of data.

In [31]:
zip_pharm_pop['Pharmacy Count'] = zip_pharm_pop['Pharmacy Count'].fillna(0)
zip_pharm_pop['Population'] = zip_pharm_pop['Population'].fillna(0)
zip_pharm_pop = zip_pharm_pop.astype({'ZIP Code':int, 'Pharmacy Count':int, 'Population':int})
zip_pharm_pop

Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude,County_x,County_y,Population,Pharmacy Count
0,32102,Astor,29.163127,-81.542822,Lake County,Lake County,2195,0
1,32158,Lady Lake,28.925391,-81.913903,Lake County,,0,23
2,32159,Lady Lake,28.926870,-81.921518,Lake County,Lake County,30135,22
3,32195,Weirsdale,28.981786,-81.909283,Lake County,Marion County,2986,17
4,32701,Altamonte Springs,28.662902,-81.371593,Seminole County,Seminole County,21997,83
...,...,...,...,...,...,...,...,...
110,34786,Windermere,28.453923,-81.566567,Orange County,Orange County,43458,37
111,34787,Winter Garden,28.496286,-81.611108,Orange County,Orange County,64723,22
112,34787,Winter Garden,28.496286,-81.611108,Lake County,Orange County,64723,22
113,34788,Leesburg,28.856915,-81.780488,Lake County,Lake County,16886,12


After noticing the ZIP Code '34787', it becomes apparent to me that when merging the data, duplicates of the data may have shown up, so I decide to filter them.

In [32]:
zip_pharm_pop.groupby('ZIP Code').count().sort_values(by='Latitude').tail(10)

Unnamed: 0_level_0,ZIP Code Name,Latitude,Longitude,County_x,County_y,Population,Pharmacy Count
ZIP Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
32784,1,1,1,1,1,1,1
32808,1,1,1,1,1,1,1
34787,2,2,2,2,2,2,2
32757,2,2,2,2,2,2,2
32703,2,2,2,2,2,2,2
32792,2,2,2,2,2,2,2
34747,2,2,2,2,2,2,2
32776,2,2,2,2,2,2,2
34771,2,2,2,2,2,2,2
32751,2,2,2,2,2,2,2


With this, I can see which index where the duplicates are located and take them out of the list.

In [33]:
multiple_counties = [34787, 32757, 32703, 32792, 34747, 32776, 34771, 32751]
for x in multiple_counties:
    print(zip_pharm_pop.loc[zip_pharm_pop['ZIP Code'] == x])

     ZIP Code  ZIP Code Name   Latitude  Longitude       County_x  \
111     34787  Winter Garden  28.496286 -81.611108  Orange County   
112     34787  Winter Garden  28.496286 -81.611108    Lake County   

          County_y  Population  Pharmacy Count  
111  Orange County       64723              22  
112  Orange County       64723              22  
    ZIP Code ZIP Code Name   Latitude  Longitude       County_x     County_y  \
27     32757    Mount Dora  28.803995 -81.648786  Orange County  Lake County   
28     32757    Mount Dora  28.803995 -81.648786    Lake County  Lake County   

    Population  Pharmacy Count  
27       26679              19  
28       26679              19  
   ZIP Code ZIP Code Name   Latitude  Longitude         County_x  \
6     32703        Apopka  28.663528 -81.474427    Orange County   
7     32703        Apopka  28.663528 -81.474427  Seminole County   

        County_y  Population  Pharmacy Count  
6  Orange County       50992              43  
7  Ora

In [34]:
zip_pharm_pop = zip_pharm_pop.drop([111,27,6,42,93,35,105,24])
zip_pharm_pop

Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude,County_x,County_y,Population,Pharmacy Count
0,32102,Astor,29.163127,-81.542822,Lake County,Lake County,2195,0
1,32158,Lady Lake,28.925391,-81.913903,Lake County,,0,23
2,32159,Lady Lake,28.926870,-81.921518,Lake County,Lake County,30135,22
3,32195,Weirsdale,28.981786,-81.909283,Lake County,Marion County,2986,17
4,32701,Altamonte Springs,28.662902,-81.371593,Seminole County,Seminole County,21997,83
...,...,...,...,...,...,...,...,...
109,34778,Winter Garden,28.687367,-81.637481,Orange County,,0,0
110,34786,Windermere,28.453923,-81.566567,Orange County,Orange County,43458,37
112,34787,Winter Garden,28.496286,-81.611108,Lake County,Orange County,64723,22
113,34788,Leesburg,28.856915,-81.780488,Lake County,Lake County,16886,12


There's no need to worry about doubling of values since area_pharmacy_count was counting based on the ZIP code and no other factors.

In [35]:
zip_pharm_pop = zip_pharm_pop.drop("County_y",axis=1).rename(columns={"County_x":"County"}).reset_index(drop=True)
zip_pharm_pop

Unnamed: 0,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
0,32102,Astor,29.163127,-81.542822,Lake County,2195,0
1,32158,Lady Lake,28.925391,-81.913903,Lake County,0,23
2,32159,Lady Lake,28.926870,-81.921518,Lake County,30135,22
3,32195,Weirsdale,28.981786,-81.909283,Lake County,2986,17
4,32701,Altamonte Springs,28.662902,-81.371593,Seminole County,21997,83
...,...,...,...,...,...,...,...
102,34778,Winter Garden,28.687367,-81.637481,Orange County,0,0
103,34786,Windermere,28.453923,-81.566567,Orange County,43458,37
104,34787,Winter Garden,28.496286,-81.611108,Lake County,64723,22
105,34788,Leesburg,28.856915,-81.780488,Lake County,16886,12


Now, I will make an arbitrary scoring system for the clusters. The cluster I will be looking for will be a cluster that contains the ZIP codes most in need of a pharmacy. This should be categorized by amount of population in the ZIP code, but mostly by the pharmacy count. Thus, I will be looking for the ZIP codes that have the highest population with 0 pharmacies. Since the clusters are created based on how similar the numbers in the columns of a row are to other rows, it seems like a good idea to create these formulas to categorize each ZIP code.

In [36]:
zip_pharm_pop2 = zip_pharm_pop[["Population","Pharmacy Count"]]

zip_pharm_pop2.loc[zip_pharm_pop2["Population"] < 100,"Population"] = -50
zip_pharm_pop2.loc[zip_pharm_pop2["Population"] > 50000,"Population"] = 100
zip_pharm_pop2.loc[zip_pharm_pop2["Population"] > 25000,"Population"] = 80 + round(zip_pharm_pop["Population"]/2500)
zip_pharm_pop2.loc[zip_pharm_pop2["Population"] > 10000,"Population"] = 60 + round(zip_pharm_pop["Population"]/750)
zip_pharm_pop2.loc[zip_pharm_pop2["Population"] > 5000,"Population"] = 40 + round(zip_pharm_pop["Population"]/500)
zip_pharm_pop2.loc[zip_pharm_pop2["Population"] > 1000,"Population"] = 20 + round(zip_pharm_pop["Population"]/250)
zip_pharm_pop2.loc[zip_pharm_pop2["Population"] > 100,"Population"] = 10 + round(zip_pharm_pop["Population"]/100)

zip_pharm_pop2.loc[zip_pharm_pop2["Pharmacy Count"] >= 100,"Pharmacy Count"] = -50
zip_pharm_pop2.loc[zip_pharm_pop2["Pharmacy Count"] > 50,"Pharmacy Count"] = 50 - zip_pharm_pop["Pharmacy Count"]
zip_pharm_pop2.loc[zip_pharm_pop2["Pharmacy Count"] >= 1,"Pharmacy Count"] = 75 - zip_pharm_pop["Pharmacy Count"]
zip_pharm_pop2.loc[zip_pharm_pop2["Pharmacy Count"] == 0,"Pharmacy Count"] = 100
#zip_pharm_pop2 = zip_pharm_pop2.replace(to_replace=['Seminole County', 'Orange County', 'Lake County', 'Osceola County'] ,value=[100,90,30,20]).astype(int)

zip_pharm_pop2 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_gui

Unnamed: 0,Population,Pharmacy Count
0,29.0,100
1,-50.0,52
2,92.0,53
3,32.0,58
4,89.0,-33
...,...,...
102,-50.0,100
103,97.0,38
104,100.0,53
105,83.0,63


with that out of the way, it's time to create the random clusters and then plot them.

In [37]:
kclusters = 8
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(zip_pharm_pop2)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([7, 2, 3, 1, 4, 1, 6, 2, 4, 0], dtype=int32)

In [38]:
zip_pharm_pop.insert(0, 'Cluster Labels', kmeans.labels_)
zip_pharm_pop

Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
0,7,32102,Astor,29.163127,-81.542822,Lake County,2195,0
1,2,32158,Lady Lake,28.925391,-81.913903,Lake County,0,23
2,3,32159,Lady Lake,28.926870,-81.921518,Lake County,30135,22
3,1,32195,Weirsdale,28.981786,-81.909283,Lake County,2986,17
4,4,32701,Altamonte Springs,28.662902,-81.371593,Seminole County,21997,83
...,...,...,...,...,...,...,...,...
102,2,34778,Winter Garden,28.687367,-81.637481,Orange County,0,0
103,6,34786,Windermere,28.453923,-81.566567,Orange County,43458,37
104,3,34787,Winter Garden,28.496286,-81.611108,Lake County,64723,22
105,3,34788,Leesburg,28.856915,-81.780488,Lake County,16886,12


In [39]:
cluster_map = folium.Map(location=[orlando_latitude, orlando_longitude], zoom_start=9.5)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, zip_code, zip_code_name, cluster, count, pop in zip(zip_pharm_pop['Latitude'], zip_pharm_pop['Longitude'], zip_pharm_pop['ZIP Code'], zip_pharm_pop['ZIP Code Name'], zip_pharm_pop['Cluster Labels'], zip_pharm_pop['Pharmacy Count'], zip_pharm_pop['Population']):
    label = folium.Popup(str(zip_code) + ', ' + str(zip_code_name) + ',' + ' Cluster ' + str(cluster) + ', Pharmacies: ' + str(int(count)) + ', Population: ' + str(int(pop)), parse_html=True)
    folium.CircleMarker([lat, lon], radius=7, popup=label, color='black', weight=1, fill=True, fill_color=rainbow[cluster], fill_opacity=0.9).add_to(cluster_map)


In [40]:
for lat, lng, name in zip(pharmacy_list2['Pharmacy Latitude'], pharmacy_list2['Pharmacy Longitude'], pharmacy_list2['Pharmacy Name']):
    label = '{}, {}, {}'.format(name, lat, lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=2, popup=label, color='Green', fill=True, fill_color='Yellow', fill_opacity=0.7, parse_html=False).add_to(cluster_map)      
cluster_map

Time to look through the data to find the most favourable cluster to add a pharmacy to.

In [41]:
print("Colour: Purple")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 0]

Colour: Purple


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
9,0,32708,Winter Springs,28.684199,-81.277237,Seminole County,44826,61
13,0,32714,Altamonte Springs,28.663926,-81.413611,Seminole County,37223,73
22,0,32750,Longwood,28.705292,-81.34611,Seminole County,24389,71
35,0,32779,Longwood,28.717816,-81.402298,Seminole County,29737,60
52,0,32817,Orlando,28.583798,-81.241879,Orange County,35317,75
53,0,32818,Orlando,28.564635,-81.481212,Orange County,55586,68
56,0,32821,Orlando,28.381613,-81.494493,Orange County,17374,70
57,0,32822,Orlando,28.491771,-81.292478,Orange County,60781,60
58,0,32824,Orlando,28.378939,-81.364017,Orange County,46545,56
59,0,32825,Orlando,28.55051,-81.251273,Orange County,61550,74


In [42]:
print("Colour: Dark Blue")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 1]

Colour: Dark Blue


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
3,1,32195,Weirsdale,28.981786,-81.909283,Lake County,2986,17
5,1,32702,Altoona,29.013635,-81.634213,Lake County,2793,1
10,1,32709,Christmas,28.564218,-81.065553,Orange County,1986,6
18,1,32735,Grand Island,28.888721,-81.746485,Lake County,4673,12
19,1,32736,Eustis,28.90372,-81.515285,Lake County,9922,1
39,1,32798,Zellwood,28.717705,-81.576182,Orange County,2595,7
55,1,32820,Orlando,28.562042,-81.12216,Orange County,9240,21
61,1,32827,Orlando,28.418887,-81.312653,Orange County,9421,40
70,1,33848,Intercession City,28.262749,-81.511129,Osceola County,1059,22
71,1,33896,Davenport,28.254276,-81.609538,Osceola County,9746,21


In [43]:
print("Colour: Blue")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 2]

Colour: Blue


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
1,2,32158,Lady Lake,28.925391,-81.913903,Lake County,0,23
7,2,32704,Apopka,28.670498,-81.527894,Orange County,0,20
14,2,32720,DeLand,29.026629,-81.340233,Lake County,0,15
21,2,32747,Lake Monroe,28.815531,-81.324813,Seminole County,0,35
24,2,32756,Mount Dora,28.74446,-81.680185,Lake County,0,15
29,2,32768,Plymouth,28.696722,-81.557011,Orange County,0,8
33,2,32777,Tangerine,28.762946,-81.630696,Orange County,0,12
63,2,32830,Orlando,28.395126,-81.56724,Orange County,5,32
75,2,34714,Clermont,28.390513,-81.722924,Lake County,0,6
76,2,34715,Clermont,28.637509,-81.767598,Lake County,0,5


In [44]:
print("Colour: Darker Turquoise")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 3]

Colour: Darker Turquoise


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
2,3,32159,Lady Lake,28.92687,-81.921518,Lake County,30135,22
12,3,32712,Apopka,28.72643,-81.5219,Orange County,46662,14
15,3,32726,Eustis,28.850021,-81.678317,Lake County,22377,21
25,3,32757,Mount Dora,28.803995,-81.648786,Lake County,26679,19
27,3,32766,Oviedo,28.638001,-81.121353,Seminole County,17252,15
32,3,32776,Sorrento,28.806416,-81.538306,Lake County,12416,2
34,3,32778,Tavares,28.797399,-81.731253,Lake County,21406,21
36,3,32784,Umatilla,28.943325,-81.693526,Lake County,12313,4
64,3,32832,Orlando,28.394327,-81.213231,Orange County,24325,6
69,3,33844,Haines City,28.093642,-81.603484,Osceola County,35280,7


In [45]:
print("Colour: Lighter Turquoise")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 4]

Colour: Lighter Turquoise


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
4,4,32701,Altamonte Springs,28.662902,-81.371593,Seminole County,21997,83
8,4,32707,Casselberry,28.664747,-81.319959,Seminole County,35369,84
16,4,32730,Casselberry,28.651445,-81.341814,Seminole County,5991,93
23,4,32751,Maitland,28.628109,-81.355038,Orange County,22073,100
37,4,32789,Winter Park,28.598721,-81.356165,Orange County,26568,100
38,4,32792,Winter Park,28.599701,-81.301261,Orange County,50991,100
40,4,32801,Orlando,28.543426,-81.37802,Orange County,12174,100
41,4,32803,Orlando,28.559768,-81.364089,Orange County,19583,100
42,4,32804,Orlando,28.576847,-81.384474,Orange County,17653,100
43,4,32805,Orlando,28.535304,-81.403578,Orange County,20695,100


In [46]:
print("Colour: Light Green")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 5]

Colour: Light Green


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
11,5,32710,Clarcona,28.613674,-81.482614,Orange County,0,53
51,5,32816,Orlando,28.597804,-81.198968,Orange County,0,57
84,5,34742,Kissimmee,28.303664,-81.447525,Osceola County,0,46


In [47]:
print("Colour: Yellow-green")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 6]

Colour: Yellow-green


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
6,6,32703,Apopka,28.663528,-81.474427,Seminole County,50992,43
20,6,32746,Lake Mary,28.766612,-81.352874,Seminole County,43232,44
26,6,32765,Oviedo,28.644105,-81.218553,Seminole County,65151,48
30,6,32771,Sanford,28.80402,-81.321501,Seminole County,54980,34
31,6,32773,Sanford,28.762291,-81.27536,Seminole County,30955,41
62,6,32828,Orlando,28.546937,-81.181843,Orange County,65665,37
83,6,34741,Kissimmee,28.304302,-81.422474,Osceola County,45920,48
85,6,34743,Kissimmee,28.327803,-81.347517,Osceola County,44335,47
86,6,34744,Kissimmee,28.303873,-81.359308,Osceola County,55210,47
87,6,34746,Kissimmee,28.29937,-81.469229,Osceola County,43249,43


In [48]:
print("Colour: Light Orange")
zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 7]

Colour: Light Orange


Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
0,7,32102,Astor,29.163127,-81.542822,Lake County,2195,0
17,7,32732,Geneva,28.753065,-81.117279,Seminole County,4809,0
28,7,32767,Paisley,28.99976,-81.50764,Lake County,3273,0
73,7,34705,Astatula,28.705429,-81.728254,Lake County,2649,0
81,7,34739,Kenansville,27.898838,-81.056426,Osceola County,334,0
101,7,34773,Saint Cloud,28.113556,-80.99858,Osceola County,3771,0


By looking at each cluster closely, I can come to the following conclusions. Clusters 0, 3, 4, and 6 contain a relatively high average population with pharmacy counts in different ranges such as 50-70, 2-30, 80-100, and 30-50. Clusters 2 and 5 contain the ZIP codes that have no population information. Cluster 1 contains ZIP codes with populations below 10,000 with pharmacy counts from above 1 until 40. Cluster 7 contains all of the ZIP codes without pharmacies with a population that is not 0. 

The most favourable cluster is clearly cluster 7 due to it having population data and it lacking in pharmacies.

## Results

Cluster 7's characteristsics ultimately make it perfect in terms of reaching the requirements of 0 pharmacies and a non-0 zero population.

In [49]:
most_favourable_cluster  = zip_pharm_pop.loc[zip_pharm_pop['Cluster Labels'] == 7]
most_favourable_cluster.sort_values(by='Population')

Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count
81,7,34739,Kenansville,27.898838,-81.056426,Osceola County,334,0
0,7,32102,Astor,29.163127,-81.542822,Lake County,2195,0
73,7,34705,Astatula,28.705429,-81.728254,Lake County,2649,0
28,7,32767,Paisley,28.99976,-81.50764,Lake County,3273,0
101,7,34773,Saint Cloud,28.113556,-80.99858,Osceola County,3771,0
17,7,32732,Geneva,28.753065,-81.117279,Seminole County,4809,0


Thus, it is only natural to show what it looks like on its own, in comparison to the locations of the pharmacies.

In [50]:
most_favourable = folium.Map(location=[orlando_latitude, orlando_longitude], zoom_start=9)
# set color scheme for the clusters
# add markers to the map
markers_colors = []
for lat, lon, zip_code, zip_code_name, county, cluster, count, pop in zip(most_favourable_cluster['Latitude'], most_favourable_cluster['Longitude'],most_favourable_cluster['ZIP Code'],most_favourable_cluster['ZIP Code Name'], most_favourable_cluster['County'],most_favourable_cluster['Cluster Labels'], most_favourable_cluster['Pharmacy Count'], most_favourable_cluster['Population']):
    label = folium.Popup(str(zip_code) + ', ' + str(zip_code_name) + ', ' + str(county) + ',' + ' Cluster ' + str(cluster) + ', Pharmacies: ' + str(int(count)) + ', Population: ' + str(int(pop)), parse_html=True)
    folium.CircleMarker([lat, lon], radius=7, popup=label, color='black', weight=1, fill=True, fill_color='red', fill_opacity=0.9).add_to(most_favourable)


In [51]:
for lat, lng, name in zip(pharmacy_list2['Pharmacy Latitude'], pharmacy_list2['Pharmacy Longitude'], pharmacy_list2['Pharmacy Name']):
    label = '{}, {}, {}'.format(name, lat, lng)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lng], radius=2.5, popup=label, color='Green', fill=True, fill_color='Yellow', fill_opacity=0.7, parse_html=False).add_to(most_favourable)      

most_favourable

## Discussion

Although the main objective of this project was to find locations that are the most in need of a pharmacy during this pandemic, a side task was also to help any new pharmacists find a place to conduct their business. That entails finding a location that will contribute to their continued business from a customer base that is preferably as large as possible without any nearby competition. 

The most favourable cluster had a label of 7, but do these ZIP codes in this cluster allow a pharmacy serve the most people per pharmacy? 

Although cluster 7 was deemed the best one due to it being the most straightforward answer, could there have been ZIP codes hiding in other clusters that are just as favourable?

To find out, I divide the population by the number of pharmacies in the area and the number of pharmacies in the area + 1 to calculate the change an extra pharmacy would have on how many people each pharmacy would serve for all ZIP codes.

In [52]:
zip_pharm_pop = zip_pharm_pop.join(
    pd.DataFrame(round(zip_pharm_pop['Population']/zip_pharm_pop['Pharmacy Count'],2)).rename(columns={0:'Current People/Pharmacy'}).replace(np.inf,np.nan)).join(
    pd.DataFrame(round(zip_pharm_pop['Population']/(zip_pharm_pop['Pharmacy Count']+1),2)).rename(columns={0:'after adding a pharmacy'})).sort_values(by='after adding a pharmacy')
zip_pharm_pop.tail(10)

Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count,Current People/Pharmacy,after adding a pharmacy
72,3,33898,Lake Wales,27.919382,-81.498099,Osceola County,17107,4,4276.75,3421.4
64,3,32832,Orlando,28.394327,-81.213231,Orange County,24325,6,4054.17,3475.0
89,3,34748,Leesburg,28.787056,-81.881917,Lake County,39230,10,3923.0,3566.36
101,7,34773,Saint Cloud,28.113556,-80.99858,Osceola County,3771,0,,3771.0
32,3,32776,Sorrento,28.806416,-81.538306,Lake County,12416,2,6208.0,4138.67
69,3,33844,Haines City,28.093642,-81.603484,Osceola County,35280,7,5040.0,4410.0
79,3,34736,Groveland,28.573448,-81.868636,Lake County,18213,3,6071.0,4553.25
17,7,32732,Geneva,28.753065,-81.117279,Seminole County,4809,0,,4809.0
19,1,32736,Eustis,28.90372,-81.515285,Lake County,9922,1,9922.0,4961.0
94,3,34759,Kissimmee,28.104017,-81.477719,Osceola County,39850,7,5692.86,4981.25


Now I then find the difference between the current and supposed future people/pharmacy to find how much of a difference the addition of an extra pharmacy would have on each ZIP code. If the 'after adding a pharmacy' shows the estimated customer base per pharmacy after adding a pharmacy, then the marginal difference shows how much that customer base decreases by per pharmacy after adding the pharmacy.

In [53]:
zip_pharm_pop.join(pd.DataFrame(abs(zip_pharm_pop.replace(np.nan,0)['Current People/Pharmacy'] - zip_pharm_pop['after adding a pharmacy'])).rename(columns={0:'marginal difference'})).sort_values(by='marginal difference').tail(10)

Unnamed: 0,Cluster Labels,ZIP Code,ZIP Code Name,Latitude,Longitude,County,Population,Pharmacy Count,Current People/Pharmacy,after adding a pharmacy,marginal difference
72,3,33898,Lake Wales,27.919382,-81.498099,Osceola County,17107,4,4276.75,3421.4,855.35
5,1,32702,Altoona,29.013635,-81.634213,Lake County,2793,1,2793.0,1396.5,1396.5
79,3,34736,Groveland,28.573448,-81.868636,Lake County,18213,3,6071.0,4553.25,1517.75
32,3,32776,Sorrento,28.806416,-81.538306,Lake County,12416,2,6208.0,4138.67,2069.33
0,7,32102,Astor,29.163127,-81.542822,Lake County,2195,0,,2195.0,2195.0
73,7,34705,Astatula,28.705429,-81.728254,Lake County,2649,0,,2649.0,2649.0
28,7,32767,Paisley,28.99976,-81.50764,Lake County,3273,0,,3273.0,3273.0
101,7,34773,Saint Cloud,28.113556,-80.99858,Osceola County,3771,0,,3771.0,3771.0
17,7,32732,Geneva,28.753065,-81.117279,Seminole County,4809,0,,4809.0,4809.0
19,1,32736,Eustis,28.90372,-81.515285,Lake County,9922,1,9922.0,4961.0,4961.0


Although the ZIP codes in cluster 7 are the ones in technically most need of a pharmacy, it can be seen in the marginal difference in people per pharmacy that Eustis in Lake County could also be a good candidate for a new pharmacy location.

In [54]:
label = folium.Popup(str(zip_pharm_pop['ZIP Code'][19]) + ', ' + str(zip_pharm_pop['ZIP Code Name'][19]) + ', ' + str(zip_pharm_pop['County'][19]) + ',' + ' Cluster ' + str(zip_pharm_pop['Cluster Labels'][19]) + ', Pharmacies: ' + str(int(zip_pharm_pop['Pharmacy Count'][19])) + ', Population: ' + str(int(zip_pharm_pop['Population'][19])), parse_html=True)
folium.CircleMarker([zip_pharm_pop['Latitude'][19], zip_pharm_pop['Longitude'][19]], radius=7, popup=label, color='black', weight=1, fill=True, fill_color='orange', fill_opacity=0.9).add_to(most_favourable)
most_favourable

## Conclusion

The purpose of this project was to find ideal locations in Greater Orlando for pharmacies to help lessen the effect of the COVID-19 pandemic and to help new pharmacists conduct their business. An ideal location was defined as having no pharmacies while having a high population. By dividing Greater Orlando into counties and further dividing them into ZIP codes, I was given locations to analyze closely. After gathering population data for each ZIP code from a website and using Foursquare to gather pharmacy location data, I was able to create clusters based on the population and number of pharmacies in the area. Clustering was necessary to classify each ZIP code according to its viability for a new pharmacy, although another potential and similarly favourable location not part of the cluster was found through some further data analysis.

Final decision will be made by stakeholders based on characteristics of each potential location, including and not limited to: distance from the closest pharmacy or hospital, distance from other potential locations, distance from delivery services and service fees, cost of construction and real estate, etc.