## For this week, you will required to submit the following:

**A description of the problem and a discussion of the background. (15 marks)**

Tampa, FL has a reputation of being a city where most citizens live in the surrounding suburbs and commute into the city for work. Because of this, the population isn't as dense as other cities, which can make it difficult for restaurant businesses to pick locations for new venues. I'd like to explore whether the existing venues (in a certain suburb?) can be modeled  based on their local area demographics to determine the best areas for a new venue to achieve a high number of "likes" or a high rating on Foursquare.

Being able to predict this information would help hedge against the usual risks that any small business faces when starting up. Sustaining a business with effective management is also important, but in order to even reach that point, a prospective founder needs to choose a location and actually open their venue. Mobile businesses such as food trucks are an alternative to making a solid decision about a specific location, but if that isn't a viable option, then it would be important to be well-informed about the potential customer base near any potential locations - otherwise, a struggling business may have to choose between relocating or closing shop altogether.

Most restaurants tend to be more popular with certain demographics than others. Market research may be required to identify what those target demographics actually are, and even if a businessperson thinks they know best, it wouldn't take much time or effort to employ some data science to convert publically-available information about local demographics and consumer sentiment about comparable venues into actionable insight on areas that are prime for a new venue.

**A description of the data and how it will be used to solve the problem. (15 marks)**

##### Data

+ Foursquare:
    + Venue locations (lat/long)
    + Venue category?
    + Venue "likes" from Foursquare are only available in Venue Details, at 1 result per call. In order to avoid burning through alloted calls by repeatedly requesting the same data, I'll probably keep a finalized JSON file after figuring out the data I want, and load it into Python from storage. 

+ Census.gov
    + Demographic breakdown: CSV files of "SEX BY AGE" (B01001 and B01001A-I, ACS2017) per Census Tract for all of Florida (the Tampa Bay area includes a few counties so I'm not narrowing this data down too much yet) 
        + All races/ethnicities https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001/0400000US12.14000
        + White alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001A/0400000US12.14000
        + Black alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001B/0400000US12.14000
        + Am. Indian alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001C/0400000US12.14000
        + Asian alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001D/0400000US12.14000
        + Hawaiian alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001E/0400000US12.14000
        + Other alone https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001F/0400000US12.14000
        + 2+ races https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001G/0400000US12.14000
        + White alone, non-Hispanic https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001H/0400000US12.14000
        + Hispanic https://factfinder.census.gov/bkmk/table/1.0/en/ACS/17_5YR/B01001I/0400000US12.14000
    
    + Shapefiles for census tracts based on the same year as the demographic data (2017) https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2017&layergroup=Census+Tracts


##### Plan

1. Get lat/long for center of Brandon, FL (suburb of Tampa)
2. Get venues within 7500 meter radius of center of Brandon, FL (town boundaries identified earlier as roughly 15km x 15km)
    + Filter by category to restaurants (section = "food")
    + Take list of venueIDs and iterate thru to call for venue details of each, specifically seeking number of "likes" and the rating
    + Store this data to avoid repeating this call (it'll eat thru the allotment quickly)
3. Import shapefiles and get lat/long for center of each census tract
    + Find distance of each to center of Brandon, FL; filter to census tracts within 15 mile radius
4. Import census tract demographic data from CSVs
    + Import per-race CSVs into individual dataframes; from 4th column on, only need every other column (estimates only, no need for margin of error; column headers are confusing bc they include an extra headers, causing them to be misaligned)
    + Append them into a single master dataframe with an additional column to identify which racial dataframe the rows came from
    + Pivot to reduce data to 1 row per census tract (4247 rows for FL); with age/sex/race and age/sex subpopulation columns (23x2x9 + 23x2x1 = 460 columns)
    + Add additional subpopulation columns aggregating by age/race, age, and race (23x10 + 23 + 2 = 255 more columns)
    + Filter to relevant census tracts
5. Calculate distance from each venue to each census tract, then calculate proximity of subpopulations as the weighted average value for each demographic subpopulation based on the distances of the census tracts from the venue times the proportion of the subpopulation in that tract vs all the tracts, divided by the number of census tracts ("how far away from the venue is this subpopulation?")
7. Dataframe to be analyzed: row per venue; columns for category, lat, long, likes, rating, and proximity of each subpopulation
8. Use machine learning algorithms such as regression and decision tree to develop model for predicting the number of likes and the rating of a venue based on category and subpopulation promixities
9. Identify "untapped markets" of census tracts where a restaurant could do well by mapping census tracts w/ their predicted number of likes and rating, along with the number of existing venues in that census tract

## For the second week, the final deliverables of the project will be:

**A link to your Notebook on your Github repository, showing your code. (15 marks)**

**A full report consisting of all of the following components (15 marks):**
+ Introduction where you discuss the business problem and who would be interested in this project.
+ Data where you describe the data that will be used to solve the problem and the source of the data.
+ Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.
+ Results section where you discuss the results.
+ Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
+ Conclusion section where you conclude the report.

**Your choice of a presentation or blogpost. (10 marks)**

In [120]:
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import urllib

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

import os

!conda install -c conda-forge gitpython --yes
from git import Repo

print('Libraries imported.')

Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/b/anaconda3

  added / updated specs:
    - gitpython


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.3.9           |           py37_0         149 KB  conda-forge
    conda-4.7.5                |           py37_0         3.0 MB  conda-forge
    gitdb2-2.0.5               |             py_0          46 KB  conda-forge
    gitpython-2.1.11           |             py_0         335 KB  conda-forge
    smmap2-2.0.5               |             py_0          21 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.5 MB

The following NEW packages will be INSTALLED:

  gitdb2             conda-forge/noarch::gitdb2-2.0.5-py_0
  gitpython          conda-forge/noarch::gitpy

In [118]:
frsq_file_dir = 'https://raw.githubusercontent.com/zenterfield/Coursera_Capstone/master/Foursquare_JSONs/'
local_frsq_dir = '/home/b/Projects/Coursera_Capstone/Foursquare_JSONs/'

In [104]:
# @hidden_cell

CLIENT_ID = 'FVEO2VGJAPTJKL0ZATZQDB5EI40JXCJH20IBSAX21AYO3U5C' # your Foursquare ID
CLIENT_SECRET = 'JAWZVIQOO510LLT3UDZTVEEHTGOVI0FQJGRWYIMVYJWKRTYP' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

In [105]:
address = 'Brandon, FL'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Brandon, Fl are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Brandon, Fl are 27.937801, -82.2859247.


In [114]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
def getNearbyVenues(names, latitudes, longitudes, venue_section = 'food', radius=7500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&section={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            venue_section
            )
            
        # make the GET request (if downloading list json directly)
        #results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # make the GET request (if using pre-downloaded list json)
        venues_url = frsq_file_dir + 'brandonFL_7.5k_food_venues_' + '20190712T2145' + '.json?raw=true'
        results = requests.get(venues_url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
            v['venue']['id'])
            for v in results]
            )

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Venue', 
                    'Venue_Latitude', 
                    'Venue_Longitude', 
                    'Venue_Category',
                    'Venue_ID']
    
    return(nearby_venues)

In [115]:
LIMIT = 500

nearby_venues = getNearbyVenues(   names = pd.Series(data="Brandon, FL"),
                                    latitudes = pd.Series(data=latitude),
                                    longitudes = pd.Series(data=longitude)
                            )

print('Done!')

Done!


In [99]:
print(nearby_venues.shape)
nearby_venues.head()

(100, 5)


Unnamed: 0,Venue,Venue_Latitude,Venue_Longitude,Venue_Category,Venue_ID
0,Moreno Bakery,27.936403,-82.295102,Bakery,4c051afef423a593f376d216
1,Thai legacy,27.938569,-82.28576,Thai Restaurant,5171cd8a498e616e54b37189
2,Babe's Pizza,27.938668,-82.293586,Pizza Place,4b7876ebf964a52026d02ee3
3,Taste Of Berlin,27.934251,-82.291767,German Restaurant,4e8f920f550342b5a3e1cce1
4,Pho Viet,27.938097,-82.301004,Vietnamese Restaurant,5261c02811d233a4c6d5e0b9


In [131]:
for ind in np.arange(nearby_venues.shape[0]):
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(
            nearby_venues.at[ind, 'Venue_ID'], 
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION )
    target_file = local_frsq_dir + nearby_venues.at[ind, 'Venue_ID'] + '.json'
    
    #use if relying on pre-downloaded venue details
    try:
        os.path.getsize(target_file)
    except:
        urllib.request.urlretrieve(url, target_file)
    venue_json = target_file 
    with open(venue_json) as json_file:
        data = json_file.json()["response"]['groups'][0]['items']
        nearby_venues.at[ind, 'Likes'] = data["response"]['venue']['likes']['count']
        nearby_venues.at[ind, 'Rating'] = data["response"]['venue']['rating']
    
    #use if downloading venue details directly
    #data = requests.get(url).json() 
    #nearby_venues.at[ind, 'Likes'] = data["response"]['venue']['likes']['count']
    #nearby_venues.at[ind, 'Rating'] = data["response"]['venue']['rating']

AttributeError: '_io.TextIOWrapper' object has no attribute 'json'

In [102]:
print(nearby_venues.shape)
nearby_venues.head()

(100, 7)


Unnamed: 0,Venue,Venue_Latitude,Venue_Longitude,Venue_Category,Venue_ID,Likes,Rating
0,Moreno Bakery,27.936403,-82.295102,Bakery,4c051afef423a593f376d216,140.0,9.2
1,Thai legacy,27.938569,-82.28576,Thai Restaurant,5171cd8a498e616e54b37189,8.0,8.4
2,Babe's Pizza,27.938668,-82.293586,Pizza Place,4b7876ebf964a52026d02ee3,34.0,8.8
3,Taste Of Berlin,27.934251,-82.291767,German Restaurant,4e8f920f550342b5a3e1cce1,23.0,8.5
4,Pho Viet,27.938097,-82.301004,Vietnamese Restaurant,5261c02811d233a4c6d5e0b9,25.0,8.7
