# Problem Description:
### If someone is looking to open a restaurant in DENVER, where would you recommend that they open it?

# Background Discussion:
Since we do not have data regarding the relative financial success of various restaurants in the area, we will have to make an educated guess on what we should look for when searching for a location to open a new restaurant. Some reasonable factors to consider might be:
<ul><li>Total non-restaurant venues nearby (positive, indicates a busy area)</li>
    <li>Total other restaurants nearby (negative, potential for lost customers)</li>
    <li>Median income of neighborhood (positive, disposable income leads to eating out)</li>
    <li>Proximity to public transportation (positive, ease of access)</li>
    </ul>

We also need to frame the scale that we will be analyzing this problem at. Due to data availability, we will use zip/postalcode as our target dimension. Due to scope of the model, a centroid point will be used for each postalcode to estimate lat/long coordinates.

# Who Would Be Interested In This Project?:
The model might be useful to anyone interested in opening a restaurant in the Denver area.

# Data Description:
Will be used for exploratory visualization as well as features for supervised machine learning. These raw numbers will be normalized for model input and weighted against the relative population of each zipcode.

<ul>
    <li>List of zipcodes in Denver Metro Area <a href="https://www.zipcodestogo.com/Colorado/">source</a></li>
    <li>Venues in each zipcode (source: FourSquare API)<li>
    <ul>Will be used for:
        <li>Total restaurants in radius</li>
        <li>Type of each</li>
    </ul>
    <li>Median income data, flattened by zipcode <a href="https://www.irs.gov/pub/irs-soi/16zp06co.xls">source</a></li>
    <li>List of public transportation stations <a href="https://opendata.arcgis.com/datasets/17749050721d427399ab4e038028929d_6.csv">source</a></li>
    <li>Population data <a href="https://s3.amazonaws.com/SplitwiseBlogJB/2010+Census+Population+By+Zipcode+(ZCTA).csv">source</a></li> </ul>

# Methodology:

#### Imports

In [108]:
import folium, re, geocoder, requests, json, uszipcode, sklearn, branca
import pandas as pd, numpy as np, matplotlib as mpl
import matplotlib.cm as cm, matplotlib.colors as colors
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans

#### Foursquare credentials

In [None]:
with open('/Users/kutch/Desktop/IBM/foursquarecreds.txt') as f:
    fsq = f.read()
client_id = re.search("(?<=CLIENT_ID)[ =]+'([A-Z0-9]+)", fsq).group(1)
client_secret = re.search("(?<=CLIENT_SECRET)[ =]+'([A-Z0-9]+)", fsq).group(1)
#not sure if coursera wants me to update this or not
VERSION = '20190501' # Foursquare API version
#from foursquare, function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Get all zipcodes in city of Denver

In [None]:
zipsearch = uszipcode.SearchEngine(simple_zipcode=True)
all_zips = pd.read_html('https://www.zipcodestogo.com/Colorado/', 
             match='Zip Codes', 
             skiprows=range(3))[0]
all_zips = all_zips.drop(range(2,len(all_zips.columns)),axis=1)
all_zips.columns = ['Postalcode','City']
all_zips = all_zips.set_index('Postalcode')
denver_zips = all_zips[all_zips['City']=='Denver']

#### Append lat/long to zipcodes

In [None]:
ll_data = [[zipcode, zipsearch.by_zipcode(zipcode).lat, zipsearch.by_zipcode(zipcode).lng] for zipcode in denver_zips.index]
ll_df = pd.DataFrame(data=ll_data, columns=['Postalcode','Latitude','Longitude']).dropna().set_index('Postalcode')
denver_zips = denver_zips.merge(ll_df,left_index=True,right_index=True)

#### Get population data
Important note: Census data cannot be mapped exactly to zip code (<a href="https://www.quora.com/Where-can-I-find-U-S-Census-data-with-population-per-ZIP-code-Other-details-such-as-age-gender-breakdown-would-be-helpful">see more</a>)

In [None]:
population = pd.read_csv("https://s3.amazonaws.com/SplitwiseBlogJB/2010+Census+Population+By+Zipcode+(ZCTA).csv")
population = population.rename(mapper={population.columns[0]:'Postalcode', 
                                       population.columns[1]:'Population'}, axis=1)
population = population.set_index('Postalcode')
population_denver = population.merge(denver_zips, left_index=True, right_index=True).drop(denver_zips.columns, axis=1)

#### Get income data

In [None]:
income = pd.read_excel('https://www.irs.gov/pub/irs-soi/16zp06co.xls',
                   header=None,
                   indexcol=0,
                   skiprows=range(6),
                   usecols=range(3)).dropna()
income.columns = ['Postalcode','Income','NumPeople']
income = income[income['Income']!='Total']
for index, row in income.iterrows():
    if re.match('\.',str(row['NumPeople'])):
        income = income.drop(index)
income['NumPeople'] = income['NumPeople'].astype(float)
income = income.set_index('Postalcode')
#most common category
tgb = income.drop(['Income'],axis=1).groupby(by=['Postalcode']).max()
income = tgb.merge(income, on=['Postalcode'], suffixes=['_x',''])
income = income[income['NumPeople_x']==income['NumPeople']].drop(['NumPeople_x','NumPeople'], axis=1)
income_denver = income.merge(denver_zips, left_index=True, right_index=True).drop(denver_zips.columns, axis=1)
income_denver = income_denver.rename(mapper={'Income':'MedianIncomeBracket'},axis=1)
income_denver['MedianIncomeBracket'] = income_denver['MedianIncomeBracket'].astype('str')
income_denver = income_denver.replace({'under':'\ through\ '},regex=True)
income_denver = income_denver.groupby(by=income_denver.index).min()

#### Get public transportation data

In [None]:
lightrail = pd.read_csv('https://opendata.arcgis.com/datasets/17749050721d427399ab4e038028929d_6.csv')
lightrail = lightrail.rename(mapper={'ZIPCODE':'Postalcode', 'PID':'NumLightrailStations'},axis=1)
lightrail = pd.DataFrame(lightrail.groupby(by=['Postalcode']).count()['NumLightrailStations'])

#### Get venue data

In [None]:
#for a given list of zipcodes
def get_nearby_venues(zipcodes, latitudes, longitudes):
    radius, limit = 500, 100
    global client_id, client_secret 
    venues_list=[]
    for name, lat, lng in zip(zipcodes, latitudes, longitudes):
        # create the API request URL
        url = f'https://api.foursquare.com/v2/venues/explore?&client_id={client_id}&client_secret={client_secret}&v={VERSION}&ll={lat},{lng}&radius={radius}&limit={limit}'
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zipcode', 
                  'ZipcodeLatitude', 
                  'ZipcodeLongitude', 
                  'Venue', 
                  'VenueLatitude', 
                  'VenueLongitude', 
                  'VenueCategory']
    return(nearby_venues)

#get venue info for all denver zipcodes
venues_denver = get_nearby_venues(zipcodes=pd.Series(denver_zips.index), 
                                  latitudes=denver_zips['Latitude'], 
                                  longitudes=denver_zips['Longitude'])
venues_denver = venues_denver.rename(mapper={'Zipcode':'Postalcode'},axis=1)

In [None]:
restaurant_categories = pd.read_csv("https://raw.githubusercontent.com/davidschneider04/Coursera_Capstone/master/RestaurantCategories.csv"
                                   ,encoding = 'latin-1')
restaurants_denver = venues_denver.merge(restaurant_categories, on='VenueCategory')
restaurants_denver = restaurants_denver.drop(['ZipcodeLatitude','ZipcodeLongitude'],axis=1)
restaurants_denver = restaurants_denver[['Postalcode','VenueCategory','Venue','VenueLatitude','VenueLongitude']]

### First, let's get a general sense of the distribution of restauarants in the area using Folium.

#### Create a map centered on Denver

In [None]:
def create_map_denver():
    location = Nominatim(user_agent="den_explorer").geocode("Denver, CO")
    latitude, longitude = location.latitude, location.longitude
    mapobj = folium.Map(location=[latitude, longitude], zoom_start=12)
    return mapobj
map_denver = create_map_denver()

#### Frame the zipcodes we are interested in. Information originally sourced from:
https://opendata.arcgis.com/datasets/6b6091f299204e4c9c406a624baf43e6_10.geojson

In [None]:
with open('/Users/kutch/Desktop/IBM/Colorado_ZIP_Code_Tabulation_Areas_ZCTA.geojson', 'r') as gjson:
    data = json.load(gjson)
tmp = data
dzips = list(denver_zips.index.astype('str').unique())
map_boundaries = {'type': 'FeatureCollection', 'features': []}
for item in tmp['features']:
    item['properties']['name'] = str(item['properties']['ZCTA5CE10'])
    zipcode = item['properties']['ZCTA5CE10']
    if zipcode in dzips:
        map_boundaries['features'].append(item)

In [None]:
def style_function(feature):
    return {
        'fillOpacity': 0.5,
        'weight': 1}

for feature in map_boundaries['features']:
    i=1
    geojson = folium.GeoJson(
        feature,
        name=f'mapfeature{i}',
        style_function=style_function
    ).add_to(map_denver)
    popup = folium.Popup(feature['properties']['name'])
    popup.add_to(geojson)
    geojson.add_to(map_denver)
    i+=1

#### Add restaurant markers to map

In [None]:
for lat, lng, label in zip(restaurants_denver['VenueLatitude'], 
                           restaurants_denver['VenueLongitude'], 
                           restaurants_denver['Venue']):
    label = folium.Popup(str(label), parse_html=True)
    restaurant_marker = folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False)
    restaurant_marker.add_to(map_denver)

#### Render the map

In [None]:
#click a circle marker to see the name of the restaurant,
##a general area to see the zipcode
map_denver

### Obviously, this is an incredibly complex problem and we could refine a model forever if the stakes here were higher and there were more resources available. For now, we will have to create a heuristic to simplify our process, our model, and its intepretation. 
In this hypothetical situation, let's say our restaurant is relatively upscale. From personal experience, I would say an oversimplified picture of a successful environment for this theoretical venue would be a location where potential customers have enough money to eat out often at our restaurant, and few enough restaurants nearby where we will not be competing for customers constantly. So, we will search for zipcodes that satisfy the following criteria:
<ul><li>Low restaurant:person ratio</li>
    <li>High median income</li></ul>

#### restaurant:person ratio

In [None]:
zipvenues_denver = venues_denver.set_index('Postalcode').groupby(by=['Postalcode']).count()['VenueCategory']
zipvenues_denver = pd.DataFrame(zipvenues_denver).rename(mapper={'VenueCategory':'TotalVenues'}, axis=1)
#restaurant density
rd_denver = zipvenues_denver.merge(population_denver, 
                                   left_index=True, right_index=True)
rd_denver['restaurant_density'] = rd_denver['TotalVenues'] / rd_denver['Population']
rd_denver = rd_denver.replace([np.inf], 0)
rd_denver = rd_denver.drop([col for col in rd_denver if col != 'restaurant_density'],
                           axis=1)
##remove outliers
tmp = rd_denver
outliers = ['80202']
tmp = tmp.loc[[i for i in tmp.index if str(i) not in outliers]]
##scale
mms = sklearn.preprocessing.MinMaxScaler()
tmp['restaurant_density'] = mms.fit_transform(tmp)
rd_denver = rd_denver.merge(tmp, how='left', 
                            left_index=True, right_index=True,
                            suffixes=('_x',''))
rd_denver = rd_denver.drop(columns=[col for col in rd_denver.columns if re.search('\_x',col)])
##add outliers back in
rd_denver = rd_denver.replace([np.nan], 1)

#### Create a combined dataset of zipcode attributes

In [87]:
#datasets: 
##denver_zips
##income_denver
##rd_denver
denver_data = denver_zips
datasets = [income_denver, rd_denver]
for dataset in datasets:
    denver_data = denver_data.merge(dataset, left_index=True, right_index=True)

### Let's refresh our map with this new data
#### Shade zipcode polygons according to inverse restaurant density

In [104]:
map_denver = create_map_denver()
denver_data['restaurant_density_color'] = denver_data['restaurant_density'].astype('str')
denver_data['restaurant_density_color'] = denver_data['restaurant_density_color'].apply(mpl.colors.to_hex)

In [105]:
def style_function(feature):
    global denver_data
    color = denver_data.loc[int(feature['properties']['name'])]['restaurant_density_color']
    return {
        'fillColor': color,
        'color': color,
        'fillOpacity': 0.5,
        'weight': 1}

#### Only plot polygons for high-income zipcodes

In [106]:
i=1
for feature in map_boundaries['features']:
    zipcodeval = feature['properties']['name'] 
    if int(zipcodeval) in denver_data[~denver_data['MedianIncomeBracket'].isin(['$1 \\ through\\  $25,000'])].index:
        geojson = folium.GeoJson(
            feature,
            name=f'mapfeature{i}',
            style_function=style_function
        ).add_to(map_denver)
        popup = folium.Popup(zipcodeval)
        popup.add_to(geojson)
        geojson.add_to(map_denver)
        i+=1

In [107]:
map_denver

#### It seems like the zipcode within the city proper that stands out the most is 80206. Let's see if we can confirm or deny our suspicions with a supervised machine learning model.

### Similarity Modeling
Suppose we do some more research, and we learn that many restaurants in a certain portion of town ("Highlands") are thriving (<a href="https://www.westword.com/restaurants/lohi-now-home-to-over-75-restaurants-and-brs-11258151">source</a>). But, we then find this research to be largely outdated and the area currently overpriced. Can we find a suitable substitute? Absent other data, this neighborhood can be our estimate of a good area to open a restaurant in. We can then use similarity modeling to find other areas in Denver that share the same features which make it favorable for restaurant development, but at a hopefully lower price.

#### "LoHi" is a neighborhood, but we've created our dataset at a zipcode level, so we need to standardize.
Using <a href="https://www.denvergov.org/maps/map/neighborhoods">this map</a>, we can see that the "Highlands" neighborhood is a subset of the 80211 zipcode. This means we will take the 80211 row of our "denver_data" DataFrame to represent a good spot for a restaurant going into our model.

#### KNN Modeling
Unlike our previous lab, we are not doing exploratory clustering with this model. Rather, we know the cluster (or in this case, data point) that interests us, and we want to know which other clusters are most similar across a combination of dimensions. So, we do not use a K-means algorithm for modeling but rather the supervised KNN model.

# Results

Results go here.

# Discussion

Discussion goes here.

In [None]:
# Conclusion

Conclusion goes here.