# World Financial Centers' Venues Data Mining

- Combines and deduplicates geospatial neighborhood data 
- Requests Foursquare nearby venues and venue categories to build profile of each neighborhood


- Required inputs
 - cities_ratings.csv
- Outputs 
 - 'world_neighborhood_coords.csv'

## Table of Contents

1. [Compile Global Neighborhood Coordinates](#clean)
    
2. [Fetch Foursquare Categories](#foursquare-categories)
    
3. [Fetch Foursquare Venues](#foursquare)

4. [Prepare model input features](#model-input)

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from foursquare import fetch_city_venues, fetch_venue_categories, venue_frequency, rank_venues_by_frequency
from geocoder import enrich_neighborhoods_with_geocoder, map_neighborhoods, map_clusters



## Compile global neighborhood coordinates
<a id="clean"/>

In [2]:
ratings = pd.read_csv('data/cities_ratings.csv', index_col=0)
ratings.head()

Unnamed: 0_level_0,Centre,Rating
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1
1,New York City,790
2,London,773
3,Hong Kong,771
4,Singapore,762
5,Shanghai,761


## Fetch venue categories from Foursquare
<a id='foursquare-categories' />

In [None]:
categories = fetch_venue_categories()
categories

In [None]:
c = pd.DataFrame(categories)
c

In [None]:
# c.to_csv('data/foursquare_venues.csv')
c = pd.read_csv('data/foursquare_venues.csv', index_col=0)

c.set_index('name', drop=True, inplace=True)
c

In [None]:
categories = list(c.category.value_counts().index)
categories

## Fetch neighborhood venues from Foursquare
<a id='foursquare' />

In [3]:
world_venues = fetch_city_venues(cities=ratings[0:2].Centre) #ratings[0:2].Centre
world_venues

https://api.foursquare.com/v2/venues/explore?&client_id=5YWYDD1OYF3I3QYJMUFJX3O1UDV5MU4NCBI30FYK4YHMGWQ4&client_secret=LFUHV1IDE4KESRNAONHO1YX42CCC5GICZ112WKKL3HHJ2PGU&v=20180605&near=New York City&intent=browse
https://api.foursquare.com/v2/venues/explore?&client_id=5YWYDD1OYF3I3QYJMUFJX3O1UDV5MU4NCBI30FYK4YHMGWQ4&client_secret=LFUHV1IDE4KESRNAONHO1YX42CCC5GICZ112WKKL3HHJ2PGU&v=20180605&near=London&intent=browse


Unnamed: 0,City,Venue,Venue Category,Venue Latitude,Venue Longitude
0,New York City,Central Park,Park,40.784083,-73.964853
1,New York City,Brooklyn Bridge Park,Park,40.702282,-73.996456
2,New York City,Gantry Plaza State Park,State / Provincial Park,40.746558,-73.958051
3,New York City,Minskoff Theatre,Theater,40.757389,-73.985537
4,New York City,Hudson River Greenway Running Path,Trail,40.732552,-74.01058
5,New York City,Conservatory Garden,Garden,40.793531,-73.952032
6,New York City,Bryant Park,Park,40.753621,-73.983265
7,New York City,Hudson River Park,Park,40.733869,-74.010454
8,New York City,The Metropolitan Museum of Art (Metropolitan M...,Art Museum,40.779729,-73.963416
9,New York City,Los Tacos No. 1,Taco Place,40.757237,-73.987454


In [None]:
world_venues[0:20]

In [None]:
world_venues[49:70]

In [None]:
world_venues.to_csv('data/nyc_london.csv')

In [None]:
world_venues_1 = fetch_city_venues(cities=ratings[2:50].Centre) #ratings[0:2].Centre
world_venues_1

In [None]:
world_venues_1.to_csv('data/hongkong_singapore.csv')

In [None]:
world_venues_1[world_venues_1.duplicated()==True]

In [None]:
world_venues_1[0:5]

In [None]:
world_venues_1[100:105]

In [None]:
world_venues['Venue Category'].value_counts()

In [None]:
world_venues.City.value_counts()

In [None]:
world_venues.to_csv('data/world_city_venues.csv')

# world_venues = pd.read_csv('data/world_neighborhood_venues.csv', index_col=0)
# world_venues

### Label Venues data with categories

In [None]:
s = c['category']

In [None]:
world_venues['Category'] = world_venues['Venue Category'].apply(lambda x: s.get(x) if s.get(x) else None)
world_venues

In [None]:
ax = world_venues['Category'].value_counts().plot(kind='barh')
ax.invert_yaxis()
plt.title('Total Foursquare Venue Categories Collected')
plt.show()

Since the bottom four categories each account for less than 1% of data points, and model features space is limited, we should drop them.

### Filter out neighborhoods with venues category accounts for < 1% of venues

In [None]:
len(world_venues)/100

In [None]:
above_venue_category_threshold = world_venues.groupby('Category').count() >= len(world_venues)/100
above_venue_category_threshold

In [None]:
filtered_venues = world_venues[world_venues['Category'].isin(above_venue_category_threshold[above_venue_category_threshold['Venue'] == True].index)]
filtered_venues.reset_index(drop=True, inplace=True)
filtered_venues

### Filter out neighborhoods with Foursquare venues count < 10

In [None]:
above_venue_count_threshold = world_venues[['Neighborhood', 'City', 'Venue']].groupby(['Neighborhood', 'City']).count() >= 10
above_venue_count_threshold

In [None]:
world_venues['Address'] = list(zip(world_venues['Neighborhood'], world_venues['City']))
filtered_venues = world_venues[world_venues['Address'].isin(above_threshold[above_threshold['Venue'] == True].index)]
filtered_venues.reset_index(drop=True, inplace=True)

In [None]:
filtered_venues

In [None]:
filtered_venues.to_csv('data/world_neighborhood_venues.csv')

In [None]:
len(filtered_venues)

In [None]:
ax = filtered_venues['Venue'].value_counts()[0:15].plot(kind='barh')
ax.invert_yaxis()
plt.title('Top 15 Foursquare Venues Collected (n=13,171)')
plt.show()

In [None]:
ax = filtered_venues['Venue Category'].value_counts()[0:15].plot(kind='barh')
ax.invert_yaxis()
plt.title('Top 15 Foursquare Venue Types Collected (n=13,171)')
plt.show()

In [None]:
filtered_venues['Category'].value_counts()

In [None]:
venues_by_neighborhood = filtered_venues.pivot_table(index=['Neighborhood', 'City'], columns='Category', values='Venue', aggfunc='count', fill_value=0)
venues_by_neighborhood

### Convert neighborhood venue type counts to frequencies

In [None]:
neighborhood_venue_frequency = venues_by_neighborhood.div(venues_by_neighborhood.sum(axis=1), axis=0)
neighborhood_venue_frequency

In [None]:
neighborhood_venue_frequency.to_csv('data/world_neighborhood_venues_frequency.csv')